Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FWIW CUDA graphs are only really useful when you have a lot of kernels you want to launch. Otherwise they seem to perform similarly to just launching all the kernels in a loop on the host.


llama.cpp saw a 10% performance improvement from using CUDA graphs, so inference does benefit from it:

https://developer.nvidia.com/blog/optimizing-llama-cpp-ai-in...


Yeah I mean I am sure there are workloads that it helps but there are also a lot where you'd think it would help and the driver will actually just fail to cooperate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: