GRPO vs DPO: Mathematical Reasoning in Small-Scale LLMs

Mathematical reasoning remains one of the most challenging capabilities for large language models, especially in resource-constrained settings. This research presents a comparative study of two optimization methods — Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) — for improving mathematical reasoning in the Qwen2.5-3B model on the GSM8K dataset. GRPO significantly outperformed DPO, achieving 71% accuracy (vs 58%) while using 33% less VRAM and converging 32% faster.

Key Findings

Metric	Baseline	GRPO	DPO	GRPO Advantage
Accuracy	46%	71%	58%	+13%
Convergence Speed	-	170 steps	250 steps	32% faster
VRAM Usage	-	16 GB	24 GB	33% lower
Training Time (500 steps)	-	1.4 hours	1.5 hours	Similar

Why This Matters

Democratization — Enables powerful math reasoning on consumer hardware
Efficiency — Lower memory footprint and faster convergence mean cheaper training runs
Accessibility — Makes advanced AI reasoning tools viable for local deployment without large-scale infrastructure

Architecture

GRPO (Group Relative Policy Optimization)

Input Question → Policy Model (Qwen2.5-3B)
                      ↓
              Generate N Responses
                      ↓
         Group-wise Advantage Estimation
              R' = (1/G) Σ Ri
              A = (R - R') / σ
                      ↓
              Update Policy (No Critic!)
                      ↓
         Optimized Mathematical Reasoning

GRPO’s key innovation is group-wise advantage estimation — it compares responses within a group rather than requiring a separate critic/value network. This eliminates an entire model from memory, cutting VRAM requirements by 33%.

DPO (Direct Preference Optimization)

Input Question → Policy Model
                      ↓
         Generate Correct & Incorrect Pairs
                      ↓
     Compute Preference Loss: -log σ(s_chosen - s_rejected)
                      ↓
              Direct Policy Update
                      ↓
         Improved Mathematical Reasoning

DPO takes a simpler approach — learning from preference pairs of correct vs incorrect answers — but requires a reference model for KL penalty computation, increasing memory overhead.

Tech Stack

Base Model: Qwen2.5-3B-Instruct
Fine-tuning: LoRA (r=16) via Unsloth for efficient 4-bit training
Training Framework: Hugging Face TRL (GRPOTrainer, DPOTrainer)
Dataset: GSM8K — 8,500 grade-school math word problems (7,500 train / 1,000 test)
Hardware: NVIDIA RTX A5000 (24GB VRAM)
Languages/Libraries: Python, PyTorch, Transformers

Reward Design (GRPO)

A multi-component reward system drives GRPO’s learning:

Correctness Reward (primary) — Exact match against ground truth answer (+2.0)
Format Compliance (structural) — Checks for proper <reasoning> and <answer> XML tags (+0.5)
Integer Validation (type checking) — Verifies the answer is a valid integer
XML Count (structure quality) — Weighted score for tag balance and count

This multi-signal approach encourages the model to produce both correct answers and well-structured reasoning chains.

Results

Accuracy Progression

GRPO: 46% → 71% (plateaus at ~170 steps)
DPO: 46% → 58% (plateaus at ~250 steps)

Reward Evolution (GRPO)

Correctness Reward: -1.0 → 1.65 (max 2.0)
Format Reward: -0.5 → 0.5 (max 0.5)
Combined Reward: Steady increase indicating improved reasoning quality

DPO Preference Learning

The gap between chosen and rejected rewards widened from ~0 to ~4.5 over training, indicating successful discrimination between correct and incorrect solutions.

When to Use Each Method

Use GRPO when:

Limited GPU memory is available
Faster convergence is needed
You can design effective reward functions
Training stability is critical

Use DPO when:

Preference data is readily available
Simpler implementation is preferred
Sufficient computational resources exist
Reward function design is challenging

Conclusion

GRPO proves to be the superior optimization method for mathematical reasoning in small-scale LLMs, delivering higher accuracy with lower resource requirements. By eliminating the need for a separate value network through group-wise advantage estimation, GRPO makes fine-tuning for mathematical reasoning accessible on consumer-grade GPUs. Future directions include extending to other benchmarks (MATH, MathQA), testing across model sizes (1B–13B), and exploring hybrid GRPO+DPO approaches.