FWIW CUDA graphs are only really useful when you have a lot of kernels you want ...

ryao · on Dec 19, 2024

llama.cpp saw a 10% performance improvement from using CUDA graphs, so inference does benefit from it:

https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-in...

saagarjha · on Dec 20, 2024

Yeah I mean I am sure there are workloads that it helps but there are also a lot where you'd think it would help and the driver will actually just fail to cooperate.