Nvidia researchers unlock 4-bit LLM training that matches 8-bit performance

Thank you for reading this post, don't forget to subscribe!

Researchers at Nvidia have developed a novel approach to train large language models (LLMs) in 4-bit quantized format while maintaining their stability and accuracy at the level of high-precision models. Their technique, NVFP4, makes it possible to train models that not only outperform other leading 4-bit formats but match the performance of the larger 8-bit FP8 format, all while using half the memory and a fraction of the compute.

The success of NVFP4 shows that enterprises can continue to cut inference costs by running leaner models that match the performance of larger ones. It also hints at a future where the cost of training LLMs will drop to a point where many more organizations can train their own bespoke models from scratch rather than just fine-tuning existing ones.

The quantization challenge

Model quantization is a technique used to reduce the computational and memory costs of running and training AI models. It works by converting the model's parameters, or weights, from high-precision formats like 16- and 32-bit floating point (BF16 and FP32) to lower-precision formats. The key challenge of quantization is to reduce the size of the model while preserving as much of its knowledge and capabilities as possible.

In recent years, 8-bit floating point formats (FP8) have become a popular industry standard, offering a good balance between performance and efficiency. They significantly lower the computational cost and memory demand for LLM training without a major drop in accuracy.

The next logical step is 4-bit floating point (FP4), which promises to halve memory usage again and further boost performance on advanced hardware. However, this transition has been challenging. Existing 4-bit formats, such as MXFP4, often struggle to maintain the same level of accuracy as their 8-bit counterparts, forcing a difficult trade-off between cost and performance.

How NVFP4 works

NVFP4 overcomes the stability and accuracy challenges of other FP4 techniques through a smarter design and a targeted training methodology. A key issue with 4-bit precision is its extremely limited range: It can only represent 16 distinct values. When converting from a high-precision format, outlier values can distort the entire dataset, harming the model's accuracy. NVFP4 uses a more sophisticated, multi-level scaling approach that better handles these outliers, allowing for a "more precise and accurate representation of tensor values during training," according to Nvidia.

Beyond the format, the researchers introduce a 4-bit training recipe that achieves accuracy comparable to FP8. A central component is their “mixed-precision strategy.” Instead of converting the entire model to NVFP4, the majority of layers are quantized while a small fraction of numerically sensitive layers are kept in a higher-precision format like BF16. This preserves stability where it matters most. The methodology also adjusts how gradients are calculated during backpropagation — or the model's learning phase — to reduce biases that can accumulate from low-precision arithmetic.

NVFP4 in practice

To test their approach, the Nvidia team trained a powerful 12-billion-parameter hybrid Mamba-Transformer model on a massive 10 trillion tokens. They then compared its performance directly against a baseline model trained in the widely popular FP8 format. The results showed that the NVFP4 model's training loss and downstream task accuracy closely tracked the FP8 version throughout the entire process.

The performance held across a wide range of domains, including knowledge-intensive reasoning, mathematics and commonsense tasks, with only a slight drop-off in coding benchmarks in late training.

"This marks, to our knowledge, the first successful demonstration of training billion-parameter language models with 4-bit precision over a multi-trillion-token horizon, laying the foundation for faster and more efficient training of future frontier models,” the researchers write.

According to Nvidia's director of product for AI and data center GPUs NvidiaShar Narasimhan, in practice, NVFP4’s 4-bit precision format enables developers and businesses to train and deploy AI models with nearly the same accuracy as traditional 8-bit formats.

“By training model weights directly in 4-bit format while preserving accuracy, it empowers developers to experiment with new architectures, iterate faster and uncover insights without being bottlenecked by resource constraints,” he told VentureBeat.

In contrast, FP8 (while already a leap forward from FP16) still imposes limits on model size and inference performance due to higher memory and bandwidth demands. “NVFP4 breaks that ceiling, offering equivalent quality with dramatically greater headroom for growth and experimentation,” Narasimhan said.

When compared to the alternative 4-bit format, MXFP4, the benefits of NVFP4 become even clearer. In an experiment with an 8-billion-parameter model, NVFP4 converged to a better loss score than MXFP4. To reach the same level of performance as the NVFP4 model, the MXFP4 model had to be trained on 36% more data, a considerable increase in training time and cost.

In addition to making pretraining more efficient, NVFP4 also redefines what’s possible. “Showing that 4-bit precision can preserve model quality at scale opens the door to a future where highly specialized models can be trained from scratch by mid-sized enterprises or startups, not just hyperscalers,” Narasimhan said, adding that, over time, we can expect a shift from developing general purpose LLMs models to “a diverse ecosystem of custom, high-performance models built by a broader range of innovators.”

Beyond pre-training

Although the paper focuses on the advantages of NVFP4 during pretraining, its impact extends to inference, as well.

“Models trained on NVFP4 can not only deliver faster inference and higher throughput but shorten the time required for AI factories to achieve ROI — accelerating the cycle from model development to real-world deployment,” Narasimhan said.

Because these models are smaller and more efficient, they unlock new possibilities for serving complex, high-quality responses in real time, even in token-intensive, agentic applications, without raising energy and compute costs.

Narasimhan said he looks toward a future of model efficiency that isn’t solely about pushing precision lower, but building smarter systems.

“There are many opportunities to expand research into lower precisions as well as modifying architectures to address the components that increasingly dominate compute in large-scale models,” he said. “These areas are rich with opportunity, especially as we move toward agentic systems that demand high throughput, low latency and adaptive reasoning. NVFP4 proves that precision can be optimized without compromising quality, and it sets the stage for a new era of intelligent, efficient AI design.”

Source link