Large Language Models struggle with precision and reasoning in medical contexts where accuracy can be life-critical. This project develops a 3B parameter reasoning model specifically for medical applications using Group Relative Policy Optimization (GRPO) combined with Chain-of-Thought (CoT) prompting. The system achieves +11% accuracy improvement over supervised fine-tuning baselines and +8.9% average accuracy across three medical QA datasets, all while training efficiently on a single consumer GPU.
Key Achievements
| Metric | Value |
|---|---|
| Accuracy improvement over SFT | +11% |
| Average accuracy gain (3 datasets) | +8.9% |
| Average perplexity reduction | -4.6 |
| Hardware requirement | Single T4/A5000 GPU |
| Trainable parameters | 5–10% of base model (via LoRA) |
Architecture
┌─────────────────────────────────────────────────────┐
│ Qwen2.5-3B Base Model │
│ (3B Parameters) │
└─────────────────────┬───────────────────────────────┘
│
▼
┌────────────────────────┐
│ LoRA 4-bit Adapters │
│ (5-10% of params) │
└────────┬───────────────┘
│
┌────────▼────────┐
│ SFT Training │
│ (Baseline) │
└────────┬────────┘
│
┌────────▼─────────────────────────┐
│ GRPO Fine-tuning │
│ ┌──────────────────────────┐ │
│ │ Reward Functions: │ │
│ │ • Semantic Similarity │ │
│ │ • Format Compliance │ │
│ │ • Answer Matching │ │
│ │ • XML Structure Count │ │
│ └──────────────────────────┘ │
└──────────────────────────────────┘
│
▼
┌───────────────────────────┐
│ Reasoning + Final Answer │
│ (XML Format) │
└───────────────────────────┘
The two-stage pipeline first establishes a baseline with Supervised Fine-Tuning (SFT) on clinical reasoning chains, then applies GRPO to optimize the model’s reasoning quality through multi-signal reward functions — no separate critic network required.
Datasets
The model was trained and evaluated across three medical QA datasets:
| Dataset | Size | Description |
|---|---|---|
| medical-o1-reasoning-SFT | 90,120 samples | Clinical questions with long reasoning chains (primary training) |
| BigBio-Med-QA | Varied | Wide range of medical topics (evaluation) |
| PubMedQA | Research-based | Evidence-based biomedical questions (evaluation) |
Reward Design
GRPO training uses four complementary reward signals with learned weights:
- Semantic Similarity (42% weight) — Sentence Transformer (all-MiniLM-L6-v2) measures cosine similarity between generated and reference Chain-of-Thought reasoning
- Answer Matching (29% weight) — Direct comparison of final answer with ground truth
- Format Compliance (15% weight) — Verifies presence of
<reasoning>and<answer>XML tags - XML Structure Count (15% weight) — Weighted score for tag balance and structural completeness
This multi-signal approach teaches the model to produce both correct answers and transparent, well-structured reasoning chains — critical for medical applications where interpretability matters.
Results
Performance Across Datasets
| Dataset | SFT Baseline | SFT + GRPO | Improvement |
|---|---|---|---|
| Base Test | 56.0% | 70.0% | +14.0% |
| BioMedQA | 52.0% | 56.4% | +4.4% |
| PubMedQA | 47.0% | 56.2% | +9.2% |
Comparison with Other Approaches
| Approach | Test Accuracy |
|---|---|
| Zero-shot CoT | 35% |
| Few-shot CoT (5 examples) | 42% |
| SFT Baseline | 56% |
| SFT + GRPO (Ours) | 67% |
Evaluation Methods
- LLM-as-Judge — Gemini 2.0 Flash evaluates logical reasoning and medical correctness
- Perplexity — Measures model confidence and fluency (-4.6 average reduction)
- Human Evaluation — Manual assessment of answer correctness and reasoning clarity
Tech Stack
- Base Model: Qwen2.5-3B-Instruct (3.09B parameters)
- Fine-tuning: LoRA 4-bit quantization via Unsloth (r=16, targeting q/k/v/o projections)
- Training Framework: Hugging Face TRL (SFTTrainer → GRPOTrainer pipeline)
- Semantic Similarity: Sentence Transformers (all-MiniLM-L6-v2)
- Hardware: Single NVIDIA T4 or A5000 GPU
- Context Window: 512 tokens
Why This Matters
- Transparency — Every medical recommendation comes with step-by-step reasoning in structured XML, making outputs interpretable and auditable
- Efficiency — Trains on a single consumer GPU with only 5–10% of parameters updated via LoRA
- Generalization — Consistent improvements across three different medical QA benchmarks, not just the training distribution
- Safety — Chain-of-Thought format forces the model to show its work, making errors easier to catch before they reach patients
Conclusion
This project demonstrates that GRPO with multi-signal rewards can significantly enhance medical reasoning in small-scale LLMs. The two-stage SFT → GRPO pipeline achieves a 67% test accuracy — nearly double the zero-shot baseline — while remaining trainable on consumer hardware. Future directions include extending to multimodal inputs (medical images, lab reports), implementing adaptive reward weighting, and developing clinical decision-support interfaces.