
Joerg Hiller
Feb 12, 2026 06:48
Together AI’s new CPD system separates warm and cold inference workloads, delivering 35-40% higher throughput for long-context AI applications on NVIDIA B200 GPUs.
Together AI has unveiled a cache-aware disaggregated inference architecture that boosts throughput by up to 40% for large language models handling long prompts—a development that could reshape economics for AI-native applications from coding copilots to retrieval-augmented systems.
The new system, called cache-aware prefill-decode disaggregation (CPD), tackles a growing bottleneck in AI infrastructure: serving prompts exceeding 100,000 tokens without crippling latency. As context windows expand across the industry, the compute required to process new versus repeated context has become a critical performance divider.
How CPD Works
The architecture splits inference into three distinct node types rather than the traditional two-tier approach. Pre-prefill nodes handle “cold” requests—prompts with little reusable context that require full computation. Standard prefill nodes prioritize “warm” requests that can pull cached key-value states instead of recomputing them. Decode nodes remain isolated for latency-sensitive token generation.
A three-level KV-cache hierarchy underpins the system: GPU memory at the fastest tier, host DRAM in the middle, and a cluster-wide distributed cache connected via RDMA at the base. When a cold request gets processed, its KV state writes to the distributed cache. Subsequent similar requests fetch this state in bulk, converting seconds of compute into hundreds of milliseconds of transfer.
The router makes real-time decisions by estimating how much of each incoming prompt can be served from cache, steering low-reuse requests to pre-prefill nodes while high-reuse traffic takes the fast path.
Benchmark Results on B200 GPUs
Testing on NVIDIA B200 GPUs with tensor parallelism across four GPUs per node revealed stark differences from conventional approaches. The baseline system—two prefill nodes sharing capacity—saturated around 0.75-0.8 queries per second per GPU. CPD sustained approximately 1.1-1.15 QPS per GPU before hitting the same wall.
Latency improvements proved equally significant. Under increasing load, the baseline’s median time-to-first-token climbed past one second and into multi-second territory. CPD maintained sub-second to low-second median TTFT even at QPS levels where the baseline had already saturated.
The 600-second steady-state tests used a synthetic coding agent workload designed to mirror real AI-assisted development scenarios—large shared codebase context with multi-turn interactions.
Why This Matters for AI Infrastructure
The performance gap highlights a shift in what determines inference efficiency. Raw model execution speed matters less than system-level scheduling when context windows grow large. A 100K-token prompt that initially requires seconds of compute can drop to a few hundred milliseconds once its context warms in the cache hierarchy.
For organizations running AI agents, multi-turn chatbots, or retrieval-augmented generation at scale, the throughput gains translate directly to infrastructure cost savings—or the ability to serve more users on existing hardware. The 35-40% improvement in sustainable QPS represents meaningful capacity that compounds across large deployments.
Together AI’s approach suggests that as foundation models continue expanding context capabilities, the infrastructure layer will need increasingly sophisticated workload separation to avoid letting expensive cold prompts dominate shared resources.
Image source: Shutterstock
