Medical Chain-of-Thought Reasoning with GRPO

Large Language Models struggle with precision and reasoning in medical contexts where accuracy can be life-critical. This project develops a 3B parameter reasoning model specifically for medical applications using Group Relative Policy Optimization (GRPO) combined with Chain-of-Thought (CoT) prompting. The system achieves +11% accuracy improvement over supervised fine-tuning baselines and +8.9% average accuracy across three medical QA datasets, all while training efficiently on a single consumer GPU.

Key Achievements

Metric	Value
Accuracy improvement over SFT	+11%
Average accuracy gain (3 datasets)	+8.9%
Average perplexity reduction	-4.6
Hardware requirement	Single T4/A5000 GPU
Trainable parameters	5–10% of base model (via LoRA)

Architecture

┌─────────────────────────────────────────────────────┐
│                  Qwen2.5-3B Base Model              │
│                   (3B Parameters)                   │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
         ┌────────────────────────┐
         │   LoRA 4-bit Adapters  │
         │   (5-10% of params)    │
         └────────┬───────────────┘
                  │
         ┌────────▼────────┐
         │  SFT Training   │
         │  (Baseline)     │
         └────────┬────────┘
                  │
         ┌────────▼─────────────────────────┐
         │     GRPO Fine-tuning             │
         │  ┌──────────────────────────┐    │
         │  │   Reward Functions:      │    │
         │  │  • Semantic Similarity   │    │
         │  │  • Format Compliance     │    │
         │  │  • Answer Matching       │    │
         │  │  • XML Structure Count   │    │
         │  └──────────────────────────┘    │
         └──────────────────────────────────┘
                          │
                          ▼
              ┌───────────────────────────┐
              │  Reasoning + Final Answer │
              │     (XML Format)          │
              └───────────────────────────┘

The two-stage pipeline first establishes a baseline with Supervised Fine-Tuning (SFT) on clinical reasoning chains, then applies GRPO to optimize the model’s reasoning quality through multi-signal reward functions — no separate critic network required.

Datasets

The model was trained and evaluated across three medical QA datasets:

Dataset	Size	Description
medical-o1-reasoning-SFT	90,120 samples	Clinical questions with long reasoning chains (primary training)
BigBio-Med-QA	Varied	Wide range of medical topics (evaluation)
PubMedQA	Research-based	Evidence-based biomedical questions (evaluation)

Reward Design

GRPO training uses four complementary reward signals with learned weights:

Semantic Similarity (42% weight) — Sentence Transformer (all-MiniLM-L6-v2) measures cosine similarity between generated and reference Chain-of-Thought reasoning
Answer Matching (29% weight) — Direct comparison of final answer with ground truth
Format Compliance (15% weight) — Verifies presence of <reasoning> and <answer> XML tags
XML Structure Count (15% weight) — Weighted score for tag balance and structural completeness

This multi-signal approach teaches the model to produce both correct answers and transparent, well-structured reasoning chains — critical for medical applications where interpretability matters.

Results

Performance Across Datasets

Dataset	SFT Baseline	SFT + GRPO	Improvement
Base Test	56.0%	70.0%	+14.0%
BioMedQA	52.0%	56.4%	+4.4%
PubMedQA	47.0%	56.2%	+9.2%

Comparison with Other Approaches

Approach	Test Accuracy
Zero-shot CoT	35%
Few-shot CoT (5 examples)	42%
SFT Baseline	56%
SFT + GRPO (Ours)	67%

Evaluation Methods

LLM-as-Judge — Gemini 2.0 Flash evaluates logical reasoning and medical correctness
Perplexity — Measures model confidence and fluency (-4.6 average reduction)
Human Evaluation — Manual assessment of answer correctness and reasoning clarity

Tech Stack

Base Model: Qwen2.5-3B-Instruct (3.09B parameters)
Fine-tuning: LoRA 4-bit quantization via Unsloth (r=16, targeting q/k/v/o projections)
Training Framework: Hugging Face TRL (SFTTrainer → GRPOTrainer pipeline)
Semantic Similarity: Sentence Transformers (all-MiniLM-L6-v2)
Hardware: Single NVIDIA T4 or A5000 GPU
Context Window: 512 tokens

Why This Matters

Transparency — Every medical recommendation comes with step-by-step reasoning in structured XML, making outputs interpretable and auditable
Efficiency — Trains on a single consumer GPU with only 5–10% of parameters updated via LoRA
Generalization — Consistent improvements across three different medical QA benchmarks, not just the training distribution
Safety — Chain-of-Thought format forces the model to show its work, making errors easier to catch before they reach patients

Conclusion

This project demonstrates that GRPO with multi-signal rewards can significantly enhance medical reasoning in small-scale LLMs. The two-stage SFT → GRPO pipeline achieves a 67% test accuracy — nearly double the zero-shot baseline — while remaining trainable on consumer hardware. Future directions include extending to multimodal inputs (medical images, lab reports), implementing adaptive reward weighting, and developing clinical decision-support interfaces.