Meet OSGym: A New OS Infrastructure Framework That Manages 1,000+ Replicas at $0.23/Day for Computer Use Agent Research

Thank you for reading this post, don't forget to subscribe!

Training AI agents that can actually use a computer — opening apps, clicking buttons, browsing the web, writing code — is one of the hardest infrastructure problems in modern AI. It’s not a data problem. It’s not a model problem. It’s a plumbing problem.

You need to spin up hundreds, potentially thousands, of full operating system environments with actual graphical user interfaces. Each one needs to run real software. Each one needs to handle unpredictable crashes. And you need all of them to run simultaneously at a cost that doesn’t bankrupt a university research lab.

That’s the problem ‘OSGym‘, a new research from a team of researchers at MIT, UIUC, CMU, USC, UVA, and UC Berkeley, is designed to solve.

What is a Computer Use Agent?

Before unpacking the infrastructure, it helps to understand what a computer use agent actually is. Unlike a chatbot that responds to text prompts, a computer use agent observes a screenshot of a desktop, decides what to do — click a button, type text, open a file — and executes that action through keyboard and mouse inputs. Think of it as an AI that can operate any software the way a human would.

Models like Anthropic’s Claude Computer Use and OpenAI’s Operator are early commercial examples. Research models like UI-TARS, Agent-S2, and CogAgent are pushing the boundaries further. But training any of these systems requires massive amounts of interaction data generated inside real OS environments — and that’s where things get expensive and complicated fast.

The Core Problem: OS Sandboxes at Scale

A coding environment or a web browser sandbox is relatively lightweight to run. A full OS sandbox with a GUI is not. Each virtual machine needs its own bootable disk (around 24 GB), its own CPU and RAM allocation, and its own display stack. Multiply that by hundreds or thousands of parallel instances and you have a resource consumption problem that typical academic compute budgets simply cannot absorb.

On top of resource costs, there’s the reliability problem. Software crashes. Browser sessions time out. Applications freeze. If your training pipeline doesn’t handle these failures gracefully, one bad VM can stall an entire training batch.

OSGym tackles both problems with four distinct architectural optimizations.

Decentralized OS State Management

The first design choice concerns how the system manages the state of each OS replica — tracking whether it’s healthy, what task it’s running, and how to recover it if something goes wrong.

A naive approach uses a single centralized manager for all replicas. This is a classic single point of failure: as replica count grows into the thousands, the central manager becomes overwhelmed, latency increases, and one crash can halt the whole system. OSGym instead gives every OS replica its own dedicated state manager. Each state manager exposes public methods modeled after the OpenAI Gym API — reset, step, and shutdown — but handles its own health monitoring and crash recovery internally. A failure in one replica cannot propagate to any other.

Hardware-Aware OS Replica Orchestration

Here’s a non-obvious insight this research surfaces: when you run many OS replicas on a single server, the bottleneck depends on how many replicas you pack per machine. For a small number of replicas per server (low K), the system is CPU-bounded — most replicas are fighting over processor time. But as you pack more replicas per server (large K), the bottleneck shifts to RAM — and RAM is dramatically cheaper than CPU.

A 32 GB DDR4 RAM module typically costs 10–20% of what a 16-core CPU costs. OSGym runs replicas as Docker containers (using Docker images from OSWorld as a foundation) rather than full Virtual Machines to reduce per-replica overhead. By choosing servers with higher RAM capacity and running more replicas per machine, the daily cost drops from around $300 for 128 replicas at K=1, to roughly $30 at K=64 — approximately $0.234 per replica per day, a number that fits comfortably within many academic grant budgets.

KVM Virtualization with Copy-on-Write Disk Management

The disk provisioning problem is solved with a filesystem technique called reflink copy-on-write (CoW). Normally, spinning up 128 VM instances would mean duplicating a 24 GB base image 128 times — over 3 TB of storage and 30 seconds of provisioning time per VM.

OSGym instead uses cp –reflink=always on XFS-formatted NVMe drives. Each per-VM disk image shares physical disk blocks with the base image and only allocates new blocks when the VM actually writes to them. The result: 128 VMs consume 366 GB of physical disk instead of 3.1 TB — an 88% reduction — and disk provisioning time drops from 30 seconds to 0.8 seconds per VM, a 37× speedup. Each VM still sees its full 24 GB logical disk with near-native CPU performance.

Robust Container Pool with Multi-Layer Fault Recovery

OSGym maintains a pre-warmed runner pool — by default, 128 runners per executor node — initialized before training begins. Rather than creating and destroying VMs on demand, runners are recycled between tasks. Before each VM creation, OSGym reads /proc/meminfo and /proc/loadavg to verify the host can safely accommodate another instance, blocking creation if available memory falls below 10% or under 8 GB absolute. Each container is memory-limited to 6 GB to prevent over-provisioning under burst scenarios.

The system also tunes Linux kernel parameters that would otherwise cause silent failures at high concurrency — for example, fs.aio-max-nr is raised from 65,536 to 1,048,576, and fs.inotify.max_user_instances from 128 to 8,192. Fault recovery operates at two levels: at the step level, each action gets up to 10 retries by default; at the task level, if a runner fails permanently, the task is automatically reassigned to a fresh runner.

Unified Task Flow and Centralized Data Server

Two design elements that are particularly important for devs integrating OSGym: every task follows a four-phase unified execution flow — Configure, Reset, Operate, Evaluate — regardless of which software or domain is involved. This standardization makes it straightforward to add new task types without changing the surrounding infrastructure.

Above the replica layer, a centralized data server Python class exposes a single-entry batched interface (__next__ and async_step) that hides all the complexity of state manager communication and queuing. The batched step method is asynchronous, meaning the training loop is never blocked while waiting for OS replicas to complete their actions.

What the Numbers Look Like in Practice

Using 1,024 parallel OS replicas, the system collected trajectories across ten task categories — including LibreOffice Writer, Calc, and Impress, Chrome, ThunderBird, VLC, VS Code, GIMP, OS system configuration, and multi-app workflows — at approximately 1,420 trajectories per minute, versus 115,654 seconds without parallelization. The entire dataset cost $43 in cloud compute.

The research team then used that data to fine-tune Qwen2.5-VL 32B via supervised fine-tuning, followed by reinforcement learning using a PPO-based semi-online asynchronous pipeline (200 steps, batch size 64, learning rate 1e-6). The resulting model achieved a 56.3% success rate on the OSWorld-Verified benchmark — competitive with existing methods for a 32B parameter base model with no task-specific tuning.

Key Takeaways

Training computer use agents is an infrastructure problem first: Full OS sandboxes with GUIs are far heavier than coding or browser environments — each VM needs ~24 GB of disk, dedicated CPU and RAM, and a display stack. Without careful optimization, scaling to hundreds of replicas is simply unaffordable for most academic labs.

RAM is a smarter scaling lever than CPU: OSGym’s hardware-aware orchestration reveals that packing more replicas per server shifts the bottleneck from CPU to RAM — and RAM is 5–10× cheaper. This single insight cuts per-replica cost from ~$2.10/day to as low as $0.23/day.

Copy-on-write disk management eliminates the storage wall. By using XFS reflink CoW (cp –reflink=always), OSGym reduces physical disk consumption by 88% and speeds up VM disk provisioning by 37× — turning a 3.1 TB, 30-second-per-VM problem into a 366 GB, 0.8-second one.

Decentralized state management is the key to robustness at scale. Giving each OS replica its own dedicated state manager means failures stay isolated. Even starting from a fully crashed state, OSGym self-recovers all replicas within a short window — critical for uninterrupted long-running training jobs.

Academic-scale computer use agent research is now financially viable. With 1,024 replicas generating 1,420 trajectories per minute and a full dataset costing just $43 in cloud compute, OSGym brings the infrastructure cost of training general-purpose computer agents within reach of university research budgets.

Check out the Paper here. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link