Mathematical reasoning remains one of the most challenging capabilities for large language models, especially in resource-constrained settings. This research presents a comparative study of two optimization methods — Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) — for improving mathematical reasoning in the Qwen2.5-3B model on the GSM8K dataset. GRPO significantly outperformed DPO, achieving 71% accuracy (vs 58%) while using 33% less VRAM and converging 32% faster.
Key Findings
| Metric | Baseline | GRPO | DPO | GRPO Advantage |
|---|---|---|---|---|
| Accuracy | 46% | 71% | 58% | +13% |
| Convergence Speed | - | 170 steps | 250 steps | 32% faster |
| VRAM Usage | - | 16 GB | 24 GB | 33% lower |
| Training Time (500 steps) | - | 1.4 hours | 1.5 hours | Similar |
Why This Matters
- Democratization — Enables powerful math reasoning on consumer hardware
- Efficiency — Lower memory footprint and faster convergence mean cheaper training runs
- Accessibility — Makes advanced AI reasoning tools viable for local deployment without large-scale infrastructure
Architecture
GRPO (Group Relative Policy Optimization)
Input Question → Policy Model (Qwen2.5-3B)
↓
Generate N Responses
↓
Group-wise Advantage Estimation
R' = (1/G) Σ Ri
A = (R - R') / σ
↓
Update Policy (No Critic!)
↓
Optimized Mathematical Reasoning
GRPO’s key innovation is group-wise advantage estimation — it compares responses within a group rather than requiring a separate critic/value network. This eliminates an entire model from memory, cutting VRAM requirements by 33%.
DPO (Direct Preference Optimization)
Input Question → Policy Model
↓
Generate Correct & Incorrect Pairs
↓
Compute Preference Loss: -log σ(s_chosen - s_rejected)
↓
Direct Policy Update
↓
Improved Mathematical Reasoning
DPO takes a simpler approach — learning from preference pairs of correct vs incorrect answers — but requires a reference model for KL penalty computation, increasing memory overhead.
Tech Stack
- Base Model: Qwen2.5-3B-Instruct
- Fine-tuning: LoRA (r=16) via Unsloth for efficient 4-bit training
- Training Framework: Hugging Face TRL (GRPOTrainer, DPOTrainer)
- Dataset: GSM8K — 8,500 grade-school math word problems (7,500 train / 1,000 test)
- Hardware: NVIDIA RTX A5000 (24GB VRAM)
- Languages/Libraries: Python, PyTorch, Transformers
Reward Design (GRPO)
A multi-component reward system drives GRPO’s learning:
- Correctness Reward (primary) — Exact match against ground truth answer (+2.0)
- Format Compliance (structural) — Checks for proper
<reasoning>and<answer>XML tags (+0.5) - Integer Validation (type checking) — Verifies the answer is a valid integer
- XML Count (structure quality) — Weighted score for tag balance and count
This multi-signal approach encourages the model to produce both correct answers and well-structured reasoning chains.
Results
Accuracy Progression
- GRPO: 46% → 71% (plateaus at ~170 steps)
- DPO: 46% → 58% (plateaus at ~250 steps)
Reward Evolution (GRPO)
- Correctness Reward: -1.0 → 1.65 (max 2.0)
- Format Reward: -0.5 → 0.5 (max 0.5)
- Combined Reward: Steady increase indicating improved reasoning quality
DPO Preference Learning
The gap between chosen and rejected rewards widened from ~0 to ~4.5 over training, indicating successful discrimination between correct and incorrect solutions.
When to Use Each Method
Use GRPO when:
- Limited GPU memory is available
- Faster convergence is needed
- You can design effective reward functions
- Training stability is critical
Use DPO when:
- Preference data is readily available
- Simpler implementation is preferred
- Sufficient computational resources exist
- Reward function design is challenging
Conclusion
GRPO proves to be the superior optimization method for mathematical reasoning in small-scale LLMs, delivering higher accuracy with lower resource requirements. By eliminating the need for a separate value network through group-wise advantage estimation, GRPO makes fine-tuning for mathematical reasoning accessible on consumer-grade GPUs. Future directions include extending to other benchmarks (MATH, MathQA), testing across model sizes (1B–13B), and exploring hybrid GRPO+DPO approaches.