oofbey a day ago

This is really pretty cool. LLM's are so bad at images, it just makes sense to use reasoning to improve them. I'd love to see this applied to a bigger model than 3B, because this task is not difficult. But the attention visualization really demonstrates that it's doing what it's supposed to.

  • skumar17 a day ago

    Thanks! I really love the visualization too. We have a hosted demo you can try as well!

    https://huggingface.co/spaces/Groundlight/grpo-vlm-decoder

    • oofbey a day ago

      Fun! I wish the demo had the attention visualization. Would that be easy to add? Is the source code for the HF demo in the repo too?

      • skumar17 a day ago

        Unfortunately it might be a bit challenging as there’s a nontrivial amount of extra computation we do for the viz, but it’s probably possible?

        • skumar17 a day ago

          The attention demo code is in the /attention_demo directory if you want to try it on your own messages too :)

  • xoofoog a day ago

    What do you mean LLMs are bad at images? GPT or Claude can read text perfectly, and describe what's in a picture in a lot of detail. I feel like replacing OCR is one of the few things you can actually trust them for.

    • skumar17 a day ago

      That’s a good observation. For this project, I found that while the base model could “read” the image, it didn’t really understand how to use it. GRPO allowed it to effectively search the solution space.

    • oofbey a day ago

      That's true - they are quite good at OCR. But they're really bad at a bunch of tasks that seem like they should be super simple. Like "are these lines crossed" or "which letter is circled". See https://vlmsareblind.github.io/ for some clear examples.