In lieu of the multi-day extravaganza that is normally Nvidia’s flagship GTC in San Jose, the company has been rolling out a series of talks and announcements online. Even the keynote has gone virtual, with Jensen’s popular and traditionally rambling talk being shifted to YouTube. To be honest, it’s actually easier to cover keynotes from a livestream in an office anyway, although I do miss all the hands-on demos and socializing that goes with the in-person conference.
In any case, this year’s event featured an impressive suite of announcements around Nividia’s new Ampere architecture for both the data center and AI on the edge, beginning with the A100 Ampere-architecture GPU.
Nvidia A100: World’s Largest 7nm Chip Features 54 Billion Transistors
Nvidia’s first Ampere-based GPU, its new A100 is also the world’s largest and most complex 7nm chip, featuring a staggering 54 billion transistors. Nvidia claims performance gains of up to 20x over previous Volta models. The A100 isn’t just for AI, as Nvidia believes it is an ideal GPGPU device for applications including data analytics, scientific computing, and cloud graphics. For lighter-weight tasks like inferencing, a single A100 can be partitioned in up to seven slices to run multiple loads in parallel. Conversely, NVLink allows multiple A100s to be tightly coupled.
All the top cloud vendors have said they plan to support the A100, including Google, Amazon, Microsoft, and Baidu. Microsoft is already planning to push the envelope of its Turing Natural Language Generation by moving to A100s for training.
Innovative TF32 Aims to Optimize AI Performance
Along with the A100, Nvidia is rolling out a new type of single-precision floating-point — TF32 — for the A100’s Tensor cores. It is a hybrid of FP16 and FP32 that aims to keep some of the performance benefits of moving to FP16 without losing as much precision. The A100’s new cores will also directly support FP64, making them increasingly useful for a variety of HPC applications. Along with a new data format, the A100 also supports sparse matrices, so that AI networks that contain many un-important nodes can be more efficiently represented.
Nvidia DGX A100: 5 PetaFLOPS in a Single Node
Along with the A100, Nvidia announced its newest data center computer, the DGX A100, a major upgrade to its current DGX models. The first DGX A100 is already in use at the US Department of Energy’s Argonne National Lab to help with COVID-19 research. Each DGX A100 features 8 A100 GPUs, providing 156 TFLOPS of FP64 performance and 320GB of GPU memory. It’s priced starting at “only” (their words) $199,000. Mellanox interconnects allow for multiple GPU deployments, but a single DGX A100 can also be partitioned in up to 56 instances to allow for running a number of smaller workloads.
In addition to its own DGX A100, Nvidia expects a number of its traditional partners, including Atos, Supermicro, and Dell, to build the A100 into their own servers. To assist in that effort, Nvidia is also selling the HGX A100 data center accelerator.
Nvidia HGX A100 Hyperscale Data Center Accelerator
The HGX A100 includes the underlying building blocks of the DGX A100 supercomputer in a form factor suitable for cloud deployment. Nvidia makes some very impressive claims for the price-performance and power efficiency gains that its cloud partners can expect from moving to the new architecture. Specifically, with today’s DGX-1 Systems Nvidia says a typical cloud cluster includes 50 DGX-1 units for training, 600 CPUs for inference, costs $11 million, occupies 25 racks, and draws 630 kW of power. With Ampere and the DGX A100, Nvidia says only one kind of computer is needed, and a lot less of them: 5 DGX A100 units for both training and inference at a cost of $1 million, occupying 1 rack, and consuming only 28 kW of power.
DGX A100 SuperPOD
Of course, if you’re a hyperscale compute center, you can never have enough processor power. So Nvidia has created a SuperPOD from 140 DGX A100 systems, 170 InfiniBand switches, 280 TB/s network fabric (using 15km of optical cable), and 4PB of flash storage. Nvidia claims that all that hardware delivers over 700 petaflops of AI performance and was built by Nvidia in under three weeks to use for its own internal research. If you have the space and the money, Nvidia has released the reference architecture for its SuperPOD, so you can build your own. Joel and I think it sounds like the makings of a great DIY article. It should be able to run his Deep Space Nine upscaling project in about a minute.
Nvidia Expands Its SaturnV Supercomputer
Of course, Nvidia has also greatly expanded its SaturnV supercomputer to take advantage of Ampere. SaturnV was composed of 1800 DGX-1 Systems, but Nividia has now added 4 DGX A100 SuperPODs, bringing SaturnV to a claimed total capacity of 4.6 exaflops. According to Nvidia, that makes it the fastest AI supercomputer in the world.
Jetson EGX A100 Takes the A100 to the Edge
Ampere and the A100 aren’t confined to the data center. Nvidia also announced a high-powered, purpose-built GPU for edge computing. The Jetson EGX A100 is built around an A100, but also includes Mellanox CX6 DX high-performance connectivity that’s secured using a line speed crypto engine. The GPU also includes support for encrypted models to help protect an OEM’s intellectual property. Updates to Nvidia’s Jetson-based toolkits for various industries (including Clara, Jarvis, Aerial, Isaac, and Metropolis) will help OEMs build robots, medical devices, and a variety of other high-end products using the EGX A100.
- Hands On With Nvidia’s New Jetson Xavier NX AI ‘Robot Brain’
- Nvidia May Be Prepping a Massive GPU With 7,936 CUDA Cores, 32GB HBM2
- Leak: Intel is Planning a 400-500W Top-End GPU to Challenge AMD, Nvidia