Nvidia says it can shrink LLM memory 20x without changing model weights

Binance
Nvidia says it can shrink LLM memory 20x without changing model weights
Binance



Thank you for reading this post, don't forget to subscribe!

Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the model itself. The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.

For enterprise AI applications that rely on agents and long contexts, this translates to reduced GPU memory costs, better prompt reuse, and up to an 8x reduction in latency by avoiding the need to recompute dropped KV cache values.

Serving large language models at scale requires managing a massive amount of data, especially for multi-turn conversations and long coding sessions. Every time a user adds to a prompt, the system relies on stored memory to avoid recomputing the entire conversation history from scratch.

However, this memory footprint grows rapidly, creating a severe bottleneck for latency and infrastructure costs.

Why KV cache becomes a bottleneck at scale

To power multi-turn AI applications like coding assistants or chat apps, large language models rely on a mechanism known as the key-value (KV) cache. This cache stores the hidden numerical representations for every previous token in a conversation. Because the model remembers the past conversation, it does not have to redundantly re-process the entire chat history each time the user submits a new prompt.

However, for AI applications with long context tasks, this cache can easily balloon to multiple gigabytes. As models scale up and generate increasingly long reasoning chains, the KV cache becomes a critical bottleneck for system throughput and latency.

This creates a difficult challenge for production environments. Because LLMs are highly memory-bound during inference, serving multiple users simultaneously is constrained by GPU memory exhaustion rather than computation time. “Effective KV cache management becomes critical, as idle caches must be quickly offloaded from GPU memory to accommodate other users, and quickly restored for resumed conversations,” Adrian Lancucki, Senior Deep Learning Engineer at Nvidia, told VentureBeat. “These infrastructure costs are now reflected in commercial pricing (e.g., as 'prompt caching') with additional charges for caching.” 

Even compromise solutions, like offloading the cache to lower-tier storage like CPU memory or SSDs, introduce significant data transfer overheads that can saturate network bandwidth and create bottlenecks.

One common solution is to compress the KV cache so that it takes up less memory. However, existing solutions often fall short of solving the problem holistically. Tools designed to compress caches for network transmission achieve low compression rates. Other compression methods require resource-intensive calculations on the fly for every single user prompt. Meanwhile, popular techniques like quantization or sparsification can introduce latency and accuracy drops or require making permanent changes to the model’s weights, which limits their practicality.

In their paper, the Nvidia researchers note that existing approaches “seldom exploit the strong low-rank structure of KV tensors.” This means that despite its huge number of dimensions and gigabytes of size, the actual underlying information in the KV cache is highly correlated and can be accurately represented using far fewer variables. Exploiting this characteristic is what KVTC focuses on.

Borrowing tricks from media codecs

At a high level, KVTC tackles the AI memory bottleneck by borrowing a proven concept from classical media: transform coding, the methodology that powers familiar image and video compression formats like JPEG. The framework shrinks the cache footprint through a fast, multi-step process that executes between inference phases to avoid slowing down the actual token generation. “This 'media compression' approach is advantageous for enterprise deployment because it is non-intrusive: it requires no changes to model weights or code and operates close to the transportation layer,” Lancucki said.

First, KVTC uses principal component analysis (PCA) to align the features of the KV cache data based on their importance. PCA is a statistical technique often used in machine learning to make models more efficient by isolating the most critical features of the data and stripping away redundancies. This part of the process is performed only once during an initial calibration phase for each model. Because the PCA alignment matrix is computed offline and reused, it does not slow down the compression process at inference time for individual user prompts.

Next, the system uses a dynamic programming algorithm to automatically budget how much memory each specific data dimension actually needs. The most critical principal components get high precision, while the trailing, less important components receive fewer bits or are assigned zero bits and dropped entirely.

Finally, the pipeline takes this optimized, quantized data and packs it into a byte array, running it through an entropy coder called DEFLATE. Because this step is executed in parallel directly on the GPU using Nvidia’s nvCOMP library, it operates at very high speeds.

To decompress the data when the user returns, KVTC simply performs the computations in reverse. To speed up the process, it performs the heavy lifting of decompression in chunks, layer-by-layer. This allows the AI model to begin computing the next response early using the first decompressed chunk while the subsequent chunks are being decompressed in the background.

20x compression, less than 1% accuracy penalty

Nvidia researchers tested KVTC on a diverse roster of models ranging from 1.5B to 70B parameters, including the Llama 3 family, Mistral NeMo, and the reasoning-heavy R1-distilled Qwen 2.5 models. They evaluated these models on a variety of benchmarks, including complex math and coding challenges like MATH-500 and LiveCodeBench, as well as intensive long-context retrieval tasks like “Needle In A Haystack” and key-value retrieval.

They pitted KVTC against several popular baselines: token eviction methods (e.g., H2O and TOVA), heavy quantization techniques (e.g., KIVI and GEAR), and xKV (a prompt compression technique based on singular value decomposition).

At an effective 20x compression ratio, KVTC consistently maintained performance within less than one percentage point of accuracy penalty in comparison to the original, uncompressed vanilla models across most tasks. When researchers pushed the system to extreme limits of up to 32x and 64x compression, KVTC held its ground remarkably well.

By contrast, popular baselines like KIVI and GEAR began to suffer massive accuracy degradation at just a 5x compression ratio, particularly on long-context tasks. Standard cache eviction methods like H2O and TOVA proved entirely inadequate as generic compressors, effectively breaking down when asked to retrieve deep contextual information.

Consider the deployment of a smaller reasoning model like Qwen 2.5 1.5B for a coding assistant. Normally, this model requires 29 KB of memory for every single token. Using an 8x compression setting, KVTC shrank that footprint to roughly 3.2 KB per token, while suffering a negligible 0.3 percentage point drop in coding accuracy. 

For enterprise architects, deciding when to deploy this technique depends heavily on the use case. “KVTC is optimized for long-context, multi-turn scenarios,” Lancucki said. He pointed to coding assistants, iterative agentic reasoning workflows — particularly when waiting for high-latency tool outputs — and iterative RAG as ideal applications. “However, the users should skip KVTC for short conversations,” he added, because the uncompressed sliding window of the newest tokens dominates the sequence in shorter interactions, preventing meaningful compression ratios.

KVTC is highly portable and an optimized implementation will soon be integrated into the KV Block Manager (KVBM) within the Dynamo framework, making it compatible with popular open-source inference engines like vLLM. 

Most importantly for user experience, KVTC considerably reduces the time to first token (TTFT), the delay between sending a prompt and the model generating the first response token. On an 8,000-token prompt, a vanilla 12B model running on an Nvidia H100 GPU takes roughly 3 seconds to recompute the history from scratch. Meanwhile a system can decompress the KVTC cache in just 380 milliseconds, delivering up to an 8x reduction in the time it takes to generate the first token.

Because KVTC does not alter how the model pays attention to tokens, it is theoretically compatible with token eviction methods like Dynamic Memory Sparsification (DMS), another advanced compression technique. DMS is an autoregressive token eviction method that optimizes memory by identifying and dropping the least important tokens from the context window entirely. 

“In principle, KVTC is complementary to DMS,” Lancucki stated. “While DMS evicts individual tokens along the time axis, KVTC compresses the data at each position separately.” However, he cautioned that while they target different dimensions, “it remains to be tested what compression ratios can be achieved with KVTC on sparsified caches.”

As models continue to scale natively to multi-million token context windows, the need for robust memory management will only grow. “Given the structural similarities and recurring patterns in KV caches across various model architectures, the emergence of a dedicated, standardized compression layer is probable,” Lancucki said. Supported by hardware advancements, AI infrastructure could soon treat KV cache compression as an invisible, standardized layer, much like video compression is to streaming today.



Source link

Binance