Back to projects
Feb 10, 2026
5 min read

Medical Chain-of-Thought Reasoning with GRPO

A medical reasoning system that enhances LLMs with Chain-of-Thought reasoning using Group Relative Policy Optimization, achieving +11% accuracy over supervised fine-tuning baselines across three medical QA datasets.

Large Language Models struggle with precision and reasoning in medical contexts where accuracy can be life-critical. This project develops a 3B parameter reasoning model specifically for medical applications using Group Relative Policy Optimization (GRPO) combined with Chain-of-Thought (CoT) prompting. The system achieves +11% accuracy improvement over supervised fine-tuning baselines and +8.9% average accuracy across three medical QA datasets, all while training efficiently on a single consumer GPU.

Key Achievements

MetricValue
Accuracy improvement over SFT+11%
Average accuracy gain (3 datasets)+8.9%
Average perplexity reduction-4.6
Hardware requirementSingle T4/A5000 GPU
Trainable parameters5–10% of base model (via LoRA)

Architecture

┌─────────────────────────────────────────────────────┐
│                  Qwen2.5-3B Base Model              │
│                   (3B Parameters)                   │
└─────────────────────┬───────────────────────────────┘


         ┌────────────────────────┐
         │   LoRA 4-bit Adapters  │
         │   (5-10% of params)    │
         └────────┬───────────────┘

         ┌────────▼────────┐
         │  SFT Training   │
         │  (Baseline)     │
         └────────┬────────┘

         ┌────────▼─────────────────────────┐
         │     GRPO Fine-tuning             │
         │  ┌──────────────────────────┐    │
         │  │   Reward Functions:      │    │
         │  │  • Semantic Similarity   │    │
         │  │  • Format Compliance     │    │
         │  │  • Answer Matching       │    │
         │  │  • XML Structure Count   │    │
         │  └──────────────────────────┘    │
         └──────────────────────────────────┘


              ┌───────────────────────────┐
              │  Reasoning + Final Answer │
              │     (XML Format)          │
              └───────────────────────────┘

The two-stage pipeline first establishes a baseline with Supervised Fine-Tuning (SFT) on clinical reasoning chains, then applies GRPO to optimize the model’s reasoning quality through multi-signal reward functions — no separate critic network required.

Datasets

The model was trained and evaluated across three medical QA datasets:

DatasetSizeDescription
medical-o1-reasoning-SFT90,120 samplesClinical questions with long reasoning chains (primary training)
BigBio-Med-QAVariedWide range of medical topics (evaluation)
PubMedQAResearch-basedEvidence-based biomedical questions (evaluation)

Reward Design

GRPO training uses four complementary reward signals with learned weights:

  1. Semantic Similarity (42% weight) — Sentence Transformer (all-MiniLM-L6-v2) measures cosine similarity between generated and reference Chain-of-Thought reasoning
  2. Answer Matching (29% weight) — Direct comparison of final answer with ground truth
  3. Format Compliance (15% weight) — Verifies presence of <reasoning> and <answer> XML tags
  4. XML Structure Count (15% weight) — Weighted score for tag balance and structural completeness

This multi-signal approach teaches the model to produce both correct answers and transparent, well-structured reasoning chains — critical for medical applications where interpretability matters.

Results

Performance Across Datasets

DatasetSFT BaselineSFT + GRPOImprovement
Base Test56.0%70.0%+14.0%
BioMedQA52.0%56.4%+4.4%
PubMedQA47.0%56.2%+9.2%

Comparison with Other Approaches

ApproachTest Accuracy
Zero-shot CoT35%
Few-shot CoT (5 examples)42%
SFT Baseline56%
SFT + GRPO (Ours)67%

Evaluation Methods

  • LLM-as-Judge — Gemini 2.0 Flash evaluates logical reasoning and medical correctness
  • Perplexity — Measures model confidence and fluency (-4.6 average reduction)
  • Human Evaluation — Manual assessment of answer correctness and reasoning clarity

Tech Stack

  • Base Model: Qwen2.5-3B-Instruct (3.09B parameters)
  • Fine-tuning: LoRA 4-bit quantization via Unsloth (r=16, targeting q/k/v/o projections)
  • Training Framework: Hugging Face TRL (SFTTrainer → GRPOTrainer pipeline)
  • Semantic Similarity: Sentence Transformers (all-MiniLM-L6-v2)
  • Hardware: Single NVIDIA T4 or A5000 GPU
  • Context Window: 512 tokens

Why This Matters

  • Transparency — Every medical recommendation comes with step-by-step reasoning in structured XML, making outputs interpretable and auditable
  • Efficiency — Trains on a single consumer GPU with only 5–10% of parameters updated via LoRA
  • Generalization — Consistent improvements across three different medical QA benchmarks, not just the training distribution
  • Safety — Chain-of-Thought format forces the model to show its work, making errors easier to catch before they reach patients

Conclusion

This project demonstrates that GRPO with multi-signal rewards can significantly enhance medical reasoning in small-scale LLMs. The two-stage SFT → GRPO pipeline achieves a 67% test accuracy — nearly double the zero-shot baseline — while remaining trainable on consumer hardware. Future directions include extending to multimodal inputs (medical images, lab reports), implementing adaptive reward weighting, and developing clinical decision-support interfaces.