The specifications of NVIDIA’s “Lovelace” AD102 GPU die have finally been confirmed. The fully activated core will contain an incredible 18,432 FP32 coresa considerable increase over the 12,288 shaders of the AMD Navi 31. As reported the other day, the RTX 4090 will ship with a few partially merged GPCs, bringing the effective core count down to 16,128. If NVIDIA plans to launch a 4090 Ti, this behemoth will take advantage of the full AD102 core and its 18,432 shaders.
According to Kopite7kimi, the The AD102 die will be able to reach 100 TFLOPs mark of single-precision performance with its core running at 2.8 GHz. What he still isn’t sure about is the SM structure. NVIDIA has a habit of slightly restructuring its SM (Compute Unit) with each generation. This time around it could be completely revamped, much like with Maxwell about eight years ago, or not.
I’ll recap what I shared a while ago about the Maxwell SM and Lovelace’s eventual SM design:
With Maxwell, the warp schedulers and resulting threads per SM/clock were quadrupled, resulting in a 135% performance gain per core. Looks like NVIDIA wants to pull another Maxwella generation known for its exceptional performance and power efficiency that completely crushed AMD’s rival Radeon offerings.
This would mean that the the total number of cores per SM would remain unchanged (128) but the resources accessible to each cluster would increase drastically. More specifically, the number of concurrent threads would double from 128 to 256. It’s hard to say how much of a performance boost this will translate to, but we’ll definitely see a big payoff. Unfortunately, this layout takes up a lot of expensive die space, which NVIDIA is already paying a lot of money to acquire (TSMC N4). So it’s hard to say if Jensen’s team actually managed to do this or set it aside for future designs.
There is also a chance that Team Green decides to go with a coupled SM design, something already introduced with Hopper. In case you missed the Hopper whitepaper, here’s a little primer on thread block clusters and Distributed Shared Memory (DSM). To make scheduling on GPUs with more than 100 SMs more efficient, Hopper and Lovelace will group every two blocks of threads in a GPC into a cluster. The main purpose of thread block clusters is to improve multithreading and SM utilization. These clusters run concurrently on SMs in a GPC.
Thanks to a SM to SM network between the two blocks of threads in a cluster, data can be efficiently shared between them. This will be one of the main scalability-enhancing features on Hopper and Lovelace, which is a key requirement when you increase the number of cores/ALUs by more than 50%.
|Camber||Turing||Ampere||Ada Lovelace||Ada Lovelace||Ada Lovelace|
|Process||TSMC 12nm||Sat 8nm LPP||TSMC 5nm||TSMC 5nm||TSMC 5nm|
|TP||16.1||37.6||~90 TFLOPs?||~50 TFLOPs||~35 TFLOPs|
|Memory||11 GB of GDDR6 memory||24 GB of GDDR6X memory||24 GB of GDDR6X memory||16 GB of GDDR6 memory||16 GB of GDDR6 memory|
|L2 cache||6 MB||6 MB||96 MB||64 MB||48 MB|
|Bus width||384 bit||384 bit||384 bit||256 bit||192 bit|
|To throw||September 2018||September 2020||August-September 2022||Q4 2022||Q4 2022|
These are the two potential ways NVIDIA can (almost) double the core count without crippling scaling or leaving some shaders underutilized. Of course, there’s always a chance that Jensen’s team will come up with something entirely new and unexpected.
#NVIDIA #RTX #FP32 #cores #GDDR6X #memory #deliver #TFLOP #TPs #GHz #Report #Material #time