I don’t know how much tool use there is these days of the llm „just“ calling image generation models, with a bunch of prompt reformulation for the text-to-image model which is most likely a „steerable“ diffusion model (really nice talks by Stefano Ermon on youtube!).
Actually multimodal models usually have a vision encoder submodel that translates image patches into tokens and then the pretrained llm and vision model are jointly finetuned. I think reading the reports about gemma or kimi VL will give a good idea here.
I don’t know how much tool use there is these days of the llm „just“ calling image generation models, with a bunch of prompt reformulation for the text-to-image model which is most likely a „steerable“ diffusion model (really nice talks by Stefano Ermon on youtube!).
Actually multimodal models usually have a vision encoder submodel that translates image patches into tokens and then the pretrained llm and vision model are jointly finetuned. I think reading the reports about gemma or kimi VL will give a good idea here.