Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Absolutely fabulous work.

Ludicrously unnecessary nitpick for "Remove all the brown pieces of candy from the glass bowl":

> Gemini 2.5 Flash - 18 attempts - No matter what we tried, Gemini 2.5 Flash always seemed to just generate an entirely new assortment of candies rather than just removing the brown ones.

The way I read the prompt, it demands that the candies should change arrangement. You didn't say "change the brown candies to a different color", you said "remove them". You can infer from the few brown ones that you can see that there are even more underneath - surely if you removed them all (even just by magically disappearing them) then the others would tumble down into a new location? The level of the candies is lower than before you started, which is what you'd expect if you remove some. Maybe it's just coincidence, but maybe this really was its reasoning. (It did unnecessarily remove the red candy from the hand though.)

I don't think any of the "passes" did as well as this, including Gemini 3.0 Pro Image. Qwen-Image-Edit did at least literally remove one of the three visible brown candies, but just recolored the other two.





That is a great point! Since we are moving towards better "world models" in terms of these multimodal models, you could reasonably argue that if the directive was to physically remove the candy that in the process of doing so, gravity/physics could affect the positioning of other objects.

You will note that the Minimum Passing Criteria allows for a color change in order to pass the prompt but with the rapid improvements in generative models, I may revise this test to be stricter, only allowing "Removal" to be considered as pass as opposed to a simple color swap.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: