FWIW CUDA graphs are only really useful when you have a lot of kernels you want to launch. Otherwise they seem to perform similarly to just launching all the kernels in a loop on the host.
Yeah I mean I am sure there are workloads that it helps but there are also a lot where you'd think it would help and the driver will actually just fail to cooperate.