I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).
There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
Because the image generation is powered by a diffusion model that is only guided by the transformer model and still has somewhat vague spatial representation especially when it comes to couplinng things like counting and complex positioning.
But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guide the denoising process of the diffusion model to put the numerical parts in the right spots.
TLDR: use SVG to outline image correctly first, then send that image with your text prompt to get Gemini 3.0 Pro to render with correct numbers and text
Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.
tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.
I'm glad that we're making progress towards a deeper understanding of what LLMs are inherently good at and what they're inherently bad at (not to say incapable of doing, but stuff that is less likely to work due to fundamental limitations).
There's similarity here with, for example, defining the architecture of software, but letting an LLM write the functions. Or asking an LLM to write you the SQL query for your data analysis, rather than asking it to do your data analysis for you.
What I'd really like to see is a more well defined taxonomy of work and studies on which bits work well with LLMs and which don't. I understand some of this intuitively, but am still building my intuition, and I see people tripping up on this all the time.
How is it that LLMs aren’t good at rendering the sequence of numbers but can reliably put the supplied pieces all in the right order?
Because the image generation is powered by a diffusion model that is only guided by the transformer model and still has somewhat vague spatial representation especially when it comes to couplinng things like counting and complex positioning.
But by using the LLM to generate code like an SVG graphic is made up of, and then using a rasterized image of that SVG as an input to the diffusion model, this takes place of the raw noise input and guide the denoising process of the diffusion model to put the numerical parts in the right spots.
I found a simple technique to get reliable text and numbers in AI generated images.
I’m surprised the image models aren’t already doing this, so wanted to share since I’m finding this so useful
TLDR: use SVG to outline image correctly first, then send that image with your text prompt to get Gemini 3.0 Pro to render with correct numbers and text
This hack definitely falls in the “duh, why didn’t I think of that” category of tricks, but glad to now have it next time imagegen comes up short
Ive been doing charts for slides like this for a while. Noticed html viz was super reliable, but I could style it with diffusion model. Its very useful for data viz.
tldr: do a standard img2img workflow where you lay out a skeleton or skeleton or low-res version, and then turn it into the final high-quality photorealistic version, instead of trying to zeroshot it purely from a text prompt.