AMD Instinct MI355X reaches NVIDIA B200 parity in MLPerf Inference v6.0

MI355X tied B200 on Llama 2 70B Offline, beat it 19% on Interactive, and crossed one million tokens per second at cluster scale.

What's new

MLCommons published MLPerf Inference v6.0 results on April 1, 2026. AMD highlighted its Instinct MI355X submissions in a same-day blog post. On Llama 2 70B, the MI355X platform tied NVIDIA's B200 in Offline mode, reached 97% of B200 throughput in Server mode, and exceeded B200 by 19% on the Interactive benchmark. On GPT-OSS-120B, MI355X delivered 111% of B200 Offline performance and 115% of B200 Server single-node performance. On the text-to-video workload Wan-2.2-t2v, MI355X reached 93% of B200 single-node performance and 87% of B300 single-node performance in Single Stream. AMD also reported clearing one million tokens per second at cluster scale on Llama 2 70B in both Server and Offline scenarios, and on GPT-OSS-120B in Offline. The submissions ran on the ROCm software stack. Source: AMD, "AMD Delivers Breakthrough MLPerf Inference 6.0 Results."

Why it matters

This is the first MLPerf round where an AMD accelerator hits parity with NVIDIA's flagship Blackwell B200 across multiple major LLM workloads. The 119% Interactive figure on Llama 2 70B is the most notable, because the Interactive scenario exposes per-query latency rather than peak batched throughput, which has historically been NVIDIA's strongest ground. Crossing one million tokens per second at cluster scale matters because that is the operating point at which large inference deployments size their fleet. Together, the results narrow the practical gap between MI355X and B200 for production serving on dense LLMs running on open-source software, the dominant pattern outside NVIDIA-stack-locked customers.

Caveats

The figures are AMD submissions to MLPerf, peer-reviewed by MLCommons but not independently reproduced outside the benchmark process. MI355X did not submit on every workload, and B300, NVIDIA's Blackwell Ultra, still leads on Single Stream text-to-video. v6.0 is a different workload mix than v5.x, so direct generation-over-generation reading should be cautious. AMD has not published model-FLOPS-utilization data for the runs, only end-to-end throughput. Software-stack maturity outside MLPerf, especially for custom kernels and TensorRT-LLM-equivalent paths, remains the binding constraint for many production deployments. Source: AMD blog, April 1, 2026.