"128 \* the-number-of-cores" of threads can make progress truly in parallel (at ...

xoranth · on Nov 10, 2023

So, in NVIDIA parlance, my Skylake laptop would have 128 "cuda cores"?

128 = 4 (physical cores) * 2 (hyperthreading) * 8 (AVX2 f32 lanes) * 2 (floating point ports per core)

reroute22 · on Nov 10, 2023

Sorta, yeah!

Also, your "128 cuda cores" of Skylake variety run at higher frequencies and work off of much bigger caches, so they are faster (in serial manner)...

...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...

...until they are faster again when the shader program uses a lot of registers and GPU occupancy drops to the floor and latency hiding stops hiding that well.

But core counts - yes, more or less.

xoranth · on Nov 10, 2023

Thank you!

> ...until they are slower, because GPU's latency hiding mechanism (with occupancy) hides load latencies very well, while CPU just stalls the pipeline on every cache miss for ungodly amounts of time...

Is the GPU latency hiding mechanism equivalent to SMT/Hyperthreading, but with more threads per physical core? Or is there more machinery?

Also, how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level? Are they out-of-order? Do they do register renaming?

frogblast · on Nov 10, 2023

As you state, GPU latency hiding is basically equivalent to hyper threading, just with more threads per core. For example, for a 'generic' modern GPU, you might have:

A "Core" (Apple's term) / "Compute Unit" (AMD) / "Streaming Multiprocessor" (Nvidia) / "Core" (CPU world). This is the basic unit that gets replicated to build smaller/larger GPUs/CPUs

* Each "Core/CU/SM" supports 32-64 waves/simdgroups/warps (amd/apple/nvidia termology), or typically 2 threads (cpu terminology for hyperthreading). ie, this is the unit that has a program counter, and is used to find other work to do when one thread is unavailable. (this blurred on later Nvidia parts with Independent Thread Scheduling.)

* The instruction set typically has a 'vector width'. 4 for SSE/NEON, 8 for AVX, or typically 32 or 64 for GPUs (but can range from 4 to 128)

* Each Core/CU/SM can execute N vector instructions per cycle (2-4 is common in both CPUs and GPUs). For example, both Apple and Nvidia GPUs have 32-wide vectors and can execute 4 vectors of FP32 FMA/cycle. So 128 FPUs total, or 256 FMAs/cycle Each of these FPUs what Nvidia calls a "Core", which is why their core counts are so high.

In short, the terminology exchange rate is 1 "Apple GPU Core" == 128 "Nvidia GPU Cores", on equivalent GPUs.

reroute22 · on Nov 11, 2023

I'll leave your first question to the other comment here from frogblast, as I really battled with how to answer it well, given my limited knowledge and being an elbow deep into an analogy, after all. I got a writer's block, and frogblast actually answered something :D

> how akin GPUs "stream multiprocessors"/cores are to CPUs ones at the microarchitectural level?

I'd say, if you want to get a feel for it in a manner directly relevant to recent designs, then reading through [1], [2], subsequent conversation between the two, and documents they reference should scratch that curiosity itch well enough, from the looks of it.

If you want a much more rigorous conversation, I could recommend the GPU portion of one of the lectures from CMU: [3], it's quite great IMO. It may lack a little bit in focus on contemporary design decisions that get actually shipped by tens of millions+ in products today and stray to alternatives a bit. It's the trade-off.

> Are they out-of-order?

Short answer: no.

GPUs may strive to achieve "out of order" by picking out a different warp entirely and making progress there, completely circumventing any register data dependencies and thus any need to track them, achieving a similar end objective in a drastically more area and power efficient manner than a Tomasulo's algorithm would.

> Do they do register renaming?

Short answer: no.

[1] https://forums.macrumors.com/threads/3d-rendering-on-apple-s...

[2] https://forums.macrumors.com/threads/3d-rendering-on-apple-s...

[3] https://www.youtube.com/watch?v=U8K13P6loyk ("Lecture 15. GPUs, VLIW, Execution Models - Carnegie Mellon - Computer Architecture 2015 - Onur Mutlu")

Const-me · on Nov 10, 2023

Not quite. These floating-point EUs are shared between both threads of the physical core. I would rather say 64 CUDA cores.