Show HN: r1_vlm – Open-Source Framework for Visual Reasoning with GRPO

oofbey 8 months ago

This is really pretty cool. LLM's are so bad at images, it just makes sense to use reasoning to improve them. I'd love to see this applied to a bigger model than 3B, because this task is not difficult. But the attention visualization really demonstrates that it's doing what it's supposed to.

skumar17 8 months ago

Thanks! I really love the visualization too. We have a hosted demo you can try as well!
https://huggingface.co/spaces/Groundlight/grpo-vlm-decoder
- oofbey 8 months ago
  
  Fun! I wish the demo had the attention visualization. Would that be easy to add? Is the source code for the HF demo in the repo too?
  - skumar17 8 months ago
    
    Unfortunately it might be a bit challenging as there’s a nontrivial amount of extra computation we do for the viz, but it’s probably possible?
    
    skumar17 8 months ago
    
    The attention demo code is in the /attention_demo directory if you want to try it on your own messages too :)
xoofoog 8 months ago

What do you mean LLMs are bad at images? GPT or Claude can read text perfectly, and describe what's in a picture in a lot of detail. I feel like replacing OCR is one of the few things you can actually trust them for.
- oofbey 8 months ago
  
  That's true - they are quite good at OCR. But they're really bad at a bunch of tasks that seem like they should be super simple. Like "are these lines crossed" or "which letter is circled". See https://vlmsareblind.github.io/ for some clear examples.
- skumar17 8 months ago
  
  That’s a good observation. For this project, I found that while the base model could “read” the image, it didn’t really understand how to use it. GRPO allowed it to effectively search the solution space.