As saagarjha mentions, vectorization of loads and stores is important for memory bandwidth and can be done automatically after unrolling a loop. Another important compiler optimization which requires and is applied after loop unrolling is pre-fetching: that is, for a loop whose iterations are independent and each perform loads and then some computation depending on the loaded value, we can re-arrange the loads to be grouped before the computations. The thread can use ILP to continue issuing loads while previous ones are still in-flight as long as it still has registers to store the results. Without unrolling, the computations are stalled waiting for the load instructions to return data, while with unrolling, we are able to overlap loads and make much better use of memory bandwidth.
I describe a situation in my blog post where automatic unrolling and pre-fetching was no longer being applied after changing a kernel to use FP16, and how I re-applied the optimizations manually to regain performance: https://andrewkchan.dev/posts/yalm.html#section-3.5
BTW, this part is not entirely true:
> Every single instruction is a vector instruction that is executed on all threads in the CUDA block simultaneously (unless it is a no-op due to divergent branching).
It is true that at one point instructions were executed in SIMT lockstep in warps, which are equal-size groups of CUDA cores (with each core mapping to one thread) that subdivide blocks and are the fundamental unit of execution on the hardware.
However, since Volta (2017), the execution model is allowed to move threads in a warp forward in any order, even in the absence of conditional code. Although from what I have seen, for now it appears that threads still move forward in SIMT lockstep and only diverge into active-inactive subsets at branches. That said, there is no guarantee on when the subsets may re-converge, and this is also only behavior which is done for efficiency by the hardware (https://stackoverflow.com/a/58122848/4151721) rather than in order to comply with any published programming model, e.g. it's implementation-specific behavior that could change at any time.
Ok, where do you nerds hang out and why am I not there?? I'm loving this discussion, y'all seem to be a rather rare breed of dev though. Where is the community for whatever this sort of dev is called? i.e., Clojure has the Clojurians Slack, the Clojurians Zulip, we have an annual Clojure conference and a few spin-offs. Where do you guys hang out??
This stuff is really awesome and I would love to dig in more!
I describe a situation in my blog post where automatic unrolling and pre-fetching was no longer being applied after changing a kernel to use FP16, and how I re-applied the optimizations manually to regain performance: https://andrewkchan.dev/posts/yalm.html#section-3.5
Here's an NVIDIA blog post which discusses pre-fetching and unrolling more generally: https://developer.nvidia.com/blog/boosting-application-perfo...