GRPO
Verifier-grounded RL with group-relative advantages
GRPO samples multiple completions per prompt, scores them with a verifier, and updates the policy using group-relative advantages.
Use GRPO when you have a programmatic reward: execution, tests, schema validation, exact answer checks, or an LLM judge.
Dashboard
Open Train, choose Reasoning, Code, Tool use, or Preferences, then choose GRPO. Pick the verifier before launch. Preflight checks the model, dataset, output path, and backend state.
CLI
halo-forge grpo train \
--dataset gsm8k \
--model Qwen/Qwen2.5-1.5B-Instruct \
--verifier json_schema \
--num-generations 4 \
--output ~/.halo-forge/runs/grpo-reasoning
Start with small group sizes, then increase --num-generations once verifier yield is healthy.