NVIDIA Blackwell Outshines in InferenceMAX™ v1 Benchmarks

Thank you for reading this post, don't forget to subscribe!

Luisa Crawford
Oct 10, 2025 02:52

NVIDIA’s Blackwell architecture demonstrates significant performance and efficiency gains in SemiAnalysis’s InferenceMAX™ v1 benchmarks, setting new standards for AI hardware.

SemiAnalysis has introduced InferenceMAX™ v1, an open source initiative aimed at evaluating inference hardware performance comprehensively. The results, published recently, reveal that NVIDIA’s latest GPUs, particularly the Blackwell series, lead in inference performance across various workloads, according to NVIDIA.

Performance Breakthroughs with NVIDIA Blackwell

NVIDIA Blackwell showcases a remarkable 15-fold performance improvement over its predecessor, the Hopper generation, translating into a significant revenue opportunity. This advancement is largely attributed to NVIDIA’s hardware-software co-design, which includes support for NVFP4 low precision format, fifth-generation NVIDIA NVLink, and advanced inference frameworks like NVIDIA TensorRT-LLM and Dynamo.

The open source nature of InferenceMAX v1 allows the AI community to replicate NVIDIA’s impressive results, providing a benchmark for performance validation across various AI inference scenarios.

Key Features of InferenceMAX v1

InferenceMAX v1 distinguishes itself with continuous, automated testing, publishing results daily. These benchmarks encompass single-node and multi-node configurations, covering a wide range of models, precisions, and sequence lengths to reflect real-world deployment scenarios.

The benchmarks provide insights into latency, throughput, and batch size performance, crucial metrics for AI applications involving reasoning tasks, document processing, and chat scenarios.

NVIDIA’s Generational Leap

The leap from NVIDIA Hopper HGX H200 to the Blackwell DGX B200 and GB200 NVL72 platforms marks a significant increase in efficiency and cost-effectiveness. Blackwell’s architecture, featuring fifth-generation Tensor Cores and advanced NVLink bandwidth, offers superior compute-per-watt and memory bandwidth, lowering the cost per million tokens considerably.

This architectural prowess is complemented by continuous software optimizations, enhancing performance over time. Notably, improvements in the TensorRT-LLM stack have led to substantial throughput gains, optimizing large language models like gpt-oss-120b.

Cost Efficiency and Scalability

GB200 NVL72 sets a new standard in AI cost efficiency, offering significantly lower total cost of ownership compared to previous generations. It achieves this by delivering higher throughput and maintaining low costs per million tokens, even at high interactivity levels.

The innovative design of GB200 NVL72, combined with Dynamo and TensorRT-LLM, maximizes the performance of Mixture of Experts (MoE) models, enabling efficient GPU use and high throughput under various SLA constraints.

Collaborative Advancements

NVIDIA’s collaboration with open source projects like SGLang and vLLM has further enhanced the performance and efficiency of Blackwell. These partnerships have led to the development of new kernels and optimizations, ensuring that NVIDIA’s hardware can fully leverage open source inference frameworks.

With these advancements, NVIDIA continues to push the boundaries of AI hardware and software, setting new benchmarks for performance and efficiency in the industry.

Image source: Shutterstock

Source link