Back to projects
Nov 20, 2025
4 min read

GRPO vs DPO: Mathematical Reasoning in Small-Scale LLMs

A comparative study of Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) for enhancing mathematical reasoning in the Qwen2.5-3B model, achieving 71% accuracy on GSM8K with 33% lower VRAM usage.

Mathematical reasoning remains one of the most challenging capabilities for large language models, especially in resource-constrained settings. This research presents a comparative study of two optimization methods — Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) — for improving mathematical reasoning in the Qwen2.5-3B model on the GSM8K dataset. GRPO significantly outperformed DPO, achieving 71% accuracy (vs 58%) while using 33% less VRAM and converging 32% faster.

Key Findings

MetricBaselineGRPODPOGRPO Advantage
Accuracy46%71%58%+13%
Convergence Speed-170 steps250 steps32% faster
VRAM Usage-16 GB24 GB33% lower
Training Time (500 steps)-1.4 hours1.5 hoursSimilar

Why This Matters

  • Democratization — Enables powerful math reasoning on consumer hardware
  • Efficiency — Lower memory footprint and faster convergence mean cheaper training runs
  • Accessibility — Makes advanced AI reasoning tools viable for local deployment without large-scale infrastructure

Architecture

GRPO (Group Relative Policy Optimization)

Input Question → Policy Model (Qwen2.5-3B)

              Generate N Responses

         Group-wise Advantage Estimation
              R' = (1/G) Σ Ri
              A = (R - R') / σ

              Update Policy (No Critic!)

         Optimized Mathematical Reasoning

GRPO’s key innovation is group-wise advantage estimation — it compares responses within a group rather than requiring a separate critic/value network. This eliminates an entire model from memory, cutting VRAM requirements by 33%.

DPO (Direct Preference Optimization)

Input Question → Policy Model

         Generate Correct & Incorrect Pairs

     Compute Preference Loss: -log σ(s_chosen - s_rejected)

              Direct Policy Update

         Improved Mathematical Reasoning

DPO takes a simpler approach — learning from preference pairs of correct vs incorrect answers — but requires a reference model for KL penalty computation, increasing memory overhead.

Tech Stack

  • Base Model: Qwen2.5-3B-Instruct
  • Fine-tuning: LoRA (r=16) via Unsloth for efficient 4-bit training
  • Training Framework: Hugging Face TRL (GRPOTrainer, DPOTrainer)
  • Dataset: GSM8K — 8,500 grade-school math word problems (7,500 train / 1,000 test)
  • Hardware: NVIDIA RTX A5000 (24GB VRAM)
  • Languages/Libraries: Python, PyTorch, Transformers

Reward Design (GRPO)

A multi-component reward system drives GRPO’s learning:

  1. Correctness Reward (primary) — Exact match against ground truth answer (+2.0)
  2. Format Compliance (structural) — Checks for proper <reasoning> and <answer> XML tags (+0.5)
  3. Integer Validation (type checking) — Verifies the answer is a valid integer
  4. XML Count (structure quality) — Weighted score for tag balance and count

This multi-signal approach encourages the model to produce both correct answers and well-structured reasoning chains.

Results

Accuracy Progression

  • GRPO: 46% → 71% (plateaus at ~170 steps)
  • DPO: 46% → 58% (plateaus at ~250 steps)

Reward Evolution (GRPO)

  • Correctness Reward: -1.0 → 1.65 (max 2.0)
  • Format Reward: -0.5 → 0.5 (max 0.5)
  • Combined Reward: Steady increase indicating improved reasoning quality

DPO Preference Learning

The gap between chosen and rejected rewards widened from ~0 to ~4.5 over training, indicating successful discrimination between correct and incorrect solutions.

When to Use Each Method

Use GRPO when:

  • Limited GPU memory is available
  • Faster convergence is needed
  • You can design effective reward functions
  • Training stability is critical

Use DPO when:

  • Preference data is readily available
  • Simpler implementation is preferred
  • Sufficient computational resources exist
  • Reward function design is challenging

Conclusion

GRPO proves to be the superior optimization method for mathematical reasoning in small-scale LLMs, delivering higher accuracy with lower resource requirements. By eliminating the need for a separate value network through group-wise advantage estimation, GRPO makes fine-tuning for mathematical reasoning accessible on consumer-grade GPUs. Future directions include extending to other benchmarks (MATH, MathQA), testing across model sizes (1B–13B), and exploring hybrid GRPO+DPO approaches.