Scaling Reasoning in VLMs with Reinforcement Learning

Introduction

With DeepSeek’s release of its R1 model and the buzz it triggered, a clear signal emerged: reinforcement learning (RL) was no longer just for niche applications like robotics. It had re-entered the mainstream. Until then, RL’s role in training large language models (LLMs) was peripheral, often overshadowed by supervised fine-tuning (SFT). But DeepSeek’s R1 model changed the conversation.

		   Figure: OpenAI’s O1 model outperforming its SOTA supervised fine-tuned baseline, GPT-4o
Figure: OpenAI’s O1 model outperforming its SOTA supervised fine-tuned baseline, GPT-4o

Before R1, models like OpenAI’s O1 had already shown how reinforcement learning could elevate reasoning abilities beyond what pure SFT could achieve. But even then, one thing remained constant: the modality. These advances were happening in a world of text-only inputs and outputs.

The frontier remained largely untouched: vision-language models (VLMs). Could reinforcement learning improve multimodal reasoning the way it improved LLMs?

This blog dives into exactly that question. We explore whether small-scale RL training can nudge or even surpass the performance of SFT in VLMs—especially in structured tasks like counting, geometry, and visual logic.

What This Blog Covers

  • Demonstrating RL’s feasibility on just ~500 samples in a VLM setting
  • Unpacking two emerging RL algorithms—GRPO and DAPO
  • Comparing them against supervised baselines
  • Sharing a lightweight Qwen2-VL model trained using DAPO, to support reproducibility and further experimentation

Verifiable Rewards in Vision-Language Models

We explore the potential of reinforcement learning in VLMs through verifiable reward signals, focusing on tasks where correctness is objectively measurable - such as counting or geometry related tasks.

Given a training sample $(x_i, y_i^*)$, our policy model $\pi_\theta$ (i.e., the VLM) is prompted to $(x_i, y_i^*)$ generate a structured response:

<think>thinking path</think>

<answer>prediction</answer>

Example:

This format encourages the model to reason step by step before producing an answer, allowing us to evaluate both the correctness of the final output and the format integrity

Reward Structure:

We employ two types of rewards:

  • Accuracy Reward: Based on the correctness of the predicted answer $y_i^*$ and the ground truth label $y_i$
  • Format Reward: Ensures the model wraps the output in expected structured format, which is pivotal for downstream parsing and interpretability.

GRPO: Group Relative Policy Optimization

At its core, GRPO is about comparing outputs within a group and rewarding the best ones. The principle idea is to give the model:

  • Multiple candidate responses for the same prompt.
  • Score each output using our reward functions (in our case, accuracy and format correctness)
  • Normalise each reward relative to its group, rather than using it in isolation.

This strategy gives us advantage for each response, like so:

$\hat{A}_{i,t} = \dfrac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}$

where:

- $r_i$ is the reward for sample $i$,

- $\mu_g$ and $\sigma_g$ are the mean and standard deviation of rewards in group $g$,

- $\epsilon$ is a small constant for numerical stability.

This measures how much better (or worse) a sample is compared to its peers. The training objective is then:

$\mathcal{J}_{\text{GRPO}}(\theta) =\mathbb{E}_{\substack{(q, a) \sim \mathcal{D} \\\{o_i\}_{i=1}^G \sim \pi_{\text{old}}(\cdot | q)}}\left[\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|}\min \left(r_{i,t}(\theta) \hat{A}_{i,t},\,\text{clip}(r_{i,t}(\theta), 1 - \epsilon, 1 + \epsilon)\, \hat{A}_{i,t}\right) \beta\, \text{KL}(\pi_\theta \parallel \pi_{\text{ref}})\right]$

Where:

- $\mathbb{E}_{(q, a) \sim \mathcal{D}}$: Denotes expectation over input samples and the group of G outputs sampled per question. Rather than optimising one response at a time, GRPO computes performance in a relative group context.

- $min(...)$: This clipping strategy is borrowed from PPO. If the new policy is moving too aggressively, we restrict the update to avoid destabilisation.

- $\beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$: This is the Kullback–Leibler divergence - a measure of how different the new distribution​ is from the old one. It acts as a regulariser to keep the learning process stable and prevents the model from forgetting what it already knew

and,

$r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\text{old}}(o_{i,t} \mid q, o_{i,<t})}$: This clipped ratio discourages the new policy from drifting too far from the old one unless it is advantageous.

DAPO: Dynamic Sampling Policy Optimization

DAPO inherits many of GRPO’s strengths but makes two bold changes:

  • It removes the KL penalty, allowing the policy to diverge more freely from its past self.
  • It computes losses at the token level, offering more granular feedback.

The DAPO loss is defined as.

$\mathcal{J}_{\text{DAPO}}(\theta) =\mathbb{E}_{\substack{(q, a) \sim \mathcal{D} \\\{o_i\}_{i=1}^G \sim \pi_{\text{old}}(\cdot | q)}}\left[\frac{1}{\sum_{i=1}^G |o_i|}\sum_{i=1}^G \sum_{t=1}^{|o_i|}\min \left(r_{i,t}(\theta) \hat{A}_{i,t},\,\text{clip}(r_{i,t}(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}})\, \hat{A}_{i,t}\right)\right]$

Subject to the constraint:

$0 < \left| \left\{ o_i \mid \text{is\_equivalent}(a, o_i) \right\} \right| < G$

Where:

- $\text{clip}(r_{i,t}(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}})$ : Ensures stable gradients by restricting the trust space.

$\frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|}$ : Enforces token-level updates. Every token has a say in the loss, encouraging local improvements even in imperfect or erroneous outputs.

Unlike GRPO, DAPO removes the leash. There is no KL-penalty anchoring the new policy to the old. That encourages model creativity and lets it explore - but only if the reward signals are strong and are sufficiently diverse within the constrained space..

Rule-based Reward

DAPO avoids reward hacking by using task verifiability:

$R(\hat{y}, y) = \begin{cases} 1, & \text{if } \text{is\_equivalent}(\hat{y}, y) \\ -1, & \text{otherwise} \end{cases}$

Training Setup

  • Cold Start: RL is applied directly to the based model without any supervised fine-tuning step.
  • No CoT Supervision: No gold thinking paths are provided
  • Verifiable Rewards: All rewards are derived from rule-based evaluators that check symbolic/numeric correctness and format compliance
  • Prompt/completion length: 512
  • Batch size: 1 per device
  • Gradient accumulation: 2
  • Generations per sample (G): 2
  • Optimiser: AdamW-torch-fused (DAPO)
  • Precision: bfloat16
  • Epsilon Range: [0.20, 0.28] (DAPO)

Dataset

The dataset chosen for Reinforcement learning was sampled from the training dataset provided by researchers at Peking University. It primarily consists of geometry-based visual question-answering (Q-A) tasks, involving questions about spatial reasoning and elementary euclidean geometry. We selected 500 examples to form a minimal, well-controlled benchmark for evaluating how reinforcement learning-trained VLMs perform on out-of-distribution (OOD) reasoning tasks. The reduced dataset size ensured faster iteration and a clearer lens through which to compare performance against standard supervised fine-tuning (SFT) baselines.

Results

We present below a distilled summary of the most relevant experiments, focusing solely on base models (Pretrained) with and without Reinforcement learning via GRPO or DAPO.

Note: Red font indicates the experiments performed by authors of R1-V

Discussion

The results above yield several key observations.

Impact of data volume and generations on GRPO and DAPO

RL Policy training, notably GRPO in our case, demonstrates a clear potential when:

  • The training dataset and number of generations per sample are scaled meaningfully.

The best performing configuration (Qwen 2.5 VL training with GRPO on 8000 rows of training data) reached 47.48% accuracy, decidedly higher than both the no-RL baseline (35.41%) and the small-data GRPO variant (31.5%).

This suggests that RL training scales well with both data and generations - the latter being critical for stabilising policy and reducing output variance.

Why didn’t RL excel in Geometry

Unlike counting or math-based tasks (which are text-based), geometry demands richer image under-standing to accurately gauge the question and the solution sought, thereafter. Hypothetically, these factors may limit RL’s advantage:

  • Low number of generations (G = 2): may lead to limited exploration and unstable optimisation (Thereby, affecting model convergence)
  • Model expressivity: Some smaller models may lack the capacity to generalise complex tasks via sparse feedback.
  • Reward Granularity: Binary rewards may not accurately represent the nuance it requires to properly do visual reasoning. It might be prudent to introduce intermediary rewards that emit reward signals at every step, reinforcing the chain of thought as opposed to solely the final answer.

This also raises the question: Do we need better encoders or better strategies?

One potential solution could be a hybrid approach:

  • Use SFT to learn structured visual concepts and reasoning patterns,
  • Follow it with RL to generalise under distributional shifts (e.g., varying visual layouts)
DAPO Requires reinforcement

DAPO underperformed in this setup (13.79%) compared to no-RL baseline (14.32%), this could be likely due to:

  • Absence of Dynamic sampling: In our experiments, due to the computational complexity, we did not utilise the constraint under which the original formulation of DAPO was conceived. Dynamic sampling appears to be pivotal in ensuring that DAPO performs better than its predecessors like GRPO.
  • Like with GRPO, less number of generations and low number of rows used to train the policy model may have further inhibited the model to truly learn specific reasoning paths, leading to instability.

Key Takeaways

Despite the rather modest setup, several encouraging patterns have emerged:

  • Even small-scale GRPO training (500 samples, 2 generations) nudged the model towards better structure-aware output
  • With sufficient data, RL proved capable of substantial gains in structured tasks.
  • The framework developed, generalises well to verifiable reward pipelines, offering flexibility for scaling to diverse domains (e.g., math, logic, spatial reasoning).
  • We invite people to explore and build upon our lightweight model - a DAPO trained Qwen 2 VL variant , which serves as a reproducible, minimal baseline for anyone looking to experiment with RL in vision-language tasks.

Final Remarks

This study explored the effectiveness of reinforcement learning in enhancing the reasoning capabilities of vision-language models (VLMs). By leveraging reward-verifiable tasks like geometry and arithmetic reasoning, we benchmarked GRPO and DAPO against standard supervised baselines across various configurations.

Our results demonstrate that

  • GRPO can deliver substantial improvements, especially when scaled with sufficient data and higher generation counts.
  • DAPO, while theoretically powerful, requires complete sampling and reward shaping strategies to outperform traditional finetuning methods.
  • Small scale RL setups can nudge performance in the right direction within very few steps. However, their full potential is realised when training with larger corpora and careful tuning.

More importantly, the experiments highlight that reinforcement learning in multimodal settings is non-trivial but promising. The space invites hybrid methods that combine supervised learning’s inductive structure with RL’s adaptability - particularly for out of distribution (OOD) generalisation.

As VLMs become increasingly important to real world AI-Systems and applications, refining how they reason and learn from structured feedback is bound to remain a critical frontier. Our experiments offer a small, but promising peek in that general direction.