Nvidia drops an ai bombshell with blackwell

2 min read

TECH TALK

IT’S BEEN TWO YEARS since Nvidia revealed its Hopper H100 GPU architecture, currently one of the most sought-after processors for AI workloads. In fact, it’s in such high demand that individual H100 accelerators can cost $30,000–$40,000. It has also been banned from export to China.

© NVIDIA

The mid-cycle refresh H200 has just started shipping. This means it’s time for Nvidia to reveal its post-Hopper data center GPU plans.

Meet Blackwell. As we mentioned on page 8, the work of its namesake, David Blackwell, has had an impact on the research and development of ar tificial intelligence. That’s fitting, considering the new GB200 GPU is set to power the next generation of massive AI supercomputers. Nvidia hasn’t spilled all the beans, so we don’t know the die size or number of processing units. However, we know that Blackwell has 208 billion transistors, and will be built on TSMC’s N4P 4nm node.

We say ‘combined’, because the Maxwell GPU is composed of two die, linked together via a new Nvidia High Bandwidth Inter face (NV-HBI). The maximum die size of a chip is around 858 mm2, but anything above 800 mm2 is effectively at the reticle size limit. Nvidia’s Ampere GA100 chip was 826 mm2, made on TSMC’s N7 node. The Hopper H100 is an 814 mm2 chip fabricated on TSMC’s N4 node. TSMC N4P won’t allow for substantially more transistors in a given area, so Nvidia’s solution is to bind two chips together. The cost per Blackwell GB200 GPU is more than twice that of Hopper H100.

Each Blackwell die has four HBM3e 12-Hi stacks of memory—12GB each with 1 TB/s of bandwidth. That’s two fewer HBM stacks per die than Hopper, which allows for more die area to focus on improving the compute. That’s still 192GB of total memory and 8TB/s of bandwidth—over double the memory capacity and bandwidth of the highest-per formance H100 solution.

This comprises two full-reticle sized dies linked together via a 10 TB/s NV-HBI interface.

Nvidia also adds support for new FP4 and FP6 number formats, with its upgrade Transformer Engine helping developers leverage new formats. These will be mainly for inference workloads, and each GB200 GPU can provide up to 20 petaflops of F