NVIDIA RTX 4090/4090 Ti with 18,432 FP32 cores and 24 GB of GDDR6X memory to deliver 100 TFLOP TPs at 2.8 GHz [Report] | Material time

The specifications of NVIDIA’s “Lovelace” AD102 GPU die have finally been confirmed. The fully activated core will contain an incredible 18,432 FP32 coresa considerable increase over the 12,288 shaders of the AMD Navi 31. As reported the other day, the RTX 4090 will ship with a few partially merged GPCs, bringing the effective core count down to 16,128. If NVIDIA plans to launch a 4090 Ti, this behemoth will take advantage of the full AD102 core and its 18,432 shaders.

According to Kopite7kimi, the The AD102 die will be able to reach 100 TFLOPs mark of single-precision performance with its core running at 2.8 GHz. What he still isn’t sure about is the SM structure. NVIDIA has a habit of slightly restructuring its SM (Compute Unit) with each generation. This time around it could be completely revamped, much like with Maxwell about eight years ago, or not.

I’ll recap what I shared a while ago about the Maxwell SM and Lovelace’s eventual SM design:

With Maxwell, the warp schedulers and resulting threads per SM/clock were quadrupled, resulting in a 135% performance gain per core. Looks like NVIDIA wants to pull another Maxwella generation known for its exceptional performance and power efficiency that completely crushed AMD’s rival Radeon offerings.

This image has an empty alt attribute;  its file name is FR91uUracAEI5U7.jpeg

This would mean that the the total number of cores per SM would remain unchanged (128) but the resources accessible to each cluster would increase drastically. More specifically, the number of concurrent threads would double from 128 to 256. It’s hard to say how much of a performance boost this will translate to, but we’ll definitely see a big payoff. Unfortunately, this layout takes up a lot of expensive die space, which NVIDIA is already paying a lot of money to acquire (TSMC N4). So it’s hard to say if Jensen’s team actually managed to do this or set it aside for future designs.

This image has an empty alt attribute;  its filename is FR91u6taQAAQBGd-1024x710.jpeg
Lovelace SM with 8 partitions
This image has an empty alt attribute;  its file name is 2019-07-21-image-2-p_1100-1024x381.webp
Fermi vs. Kepler vs. Maxwell vs. Turing SM

There is also a chance that Team Green decides to go with a coupled SM design, something already introduced with Hopper. In case you missed the Hopper whitepaper, here’s a little primer on thread block clusters and Distributed Shared Memory (DSM). To make scheduling on GPUs with more than 100 SMs more efficient, Hopper and Lovelace will group every two blocks of threads in a GPC into a cluster. The main purpose of thread block clusters is to improve multithreading and SM utilization. These clusters run concurrently on SMs in a GPC.

This image has an empty alt attribute;  its file name is Screenshot-2022-05-16-at-20-41-41-NVIDIA-H100-Tensor-Core-GPU-Architecture-Overview-1024x393.png

Thanks to a SM to SM network between the two blocks of threads in a cluster, data can be efficiently shared between them. This will be one of the main scalability-enhancing features on Hopper and Lovelace, which is a key requirement when you increase the number of cores/ALUs by more than 50%.

This image has an empty alt attribute;  its file name is Screenshot-2022-05-16-at-20-41-02-NVIDIA-H100-Tensor-Core-GPU-Architecture-Overview-1024x318.png
GPUs TU102 GA102 AD102 AD103 AD104
Camber Turing Ampere Ada Lovelace Ada Lovelace Ada Lovelace
Process TSMC 12nm Sat 8nm LPP TSMC 5nm TSMC 5nm TSMC 5nm
CPG 6 7 12 7 5
TPC 36 42 72 42 30
SMS 72 84 144 84 60
Shaders 4,608 10,752 18,432 10,752 7,680
TP 16.1 37.6 ~90 TFLOPs? ~50 TFLOPs ~35 TFLOPs
Memory 11 GB of GDDR6 memory 24 GB of GDDR6X memory 24 GB of GDDR6X memory 16 GB of GDDR6 memory 16 GB of GDDR6 memory
L2 cache 6 MB 6 MB 96 MB 64 MB 48 MB
Bus width 384 bit 384 bit 384 bit 256 bit 192 bit
TGP 250W 350W 600W? 350W? 250W?
To throw September 2018 September 2020 August-September 2022 Q4 2022 Q4 2022

These are the two potential ways NVIDIA can (almost) double the core count without crippling scaling or leaving some shaders underutilized. Of course, there’s always a chance that Jensen’s team will come up with something entirely new and unexpected.

#NVIDIA #RTX #FP32 #cores #GDDR6X #memory #deliver #TFLOP #TPs #GHz #Report #Material #time

Leave a Comment

Your email address will not be published. Required fields are marked *