Trainers
Halo-forge ships four post-training algorithms. They share a common config / dispatch / output shape so the public API and frontend treat every run the same way regardless of which algorithm produced it.
Halo-forge ships a broad post-training surface. The dashboard exposes these as goal-first methods in Train, while the CLI keeps the exact command surface for automation. They share a common config / dispatch / output shape so the public API and frontend treat every run the same way regardless of which algorithm produced it.
| Algorithm | What it does | When to use | CLI |
|---|---|---|---|
| SFT | Supervised finetuning on instruction/response pairs. | Adapting a base to a domain or task; first step in every recipe. | halo-forge sft train |
| DPO | Preference optimization from (prompt, chosen, rejected) triples. | Alignment without RL. The default published format on HF. | halo-forge dpo train |
| ORPO | Reference-free preference tuning from chosen/rejected pairs. | Lower-memory preference pass when DPO’s reference model is too expensive. | halo-forge orpo train |
| RM | Bradley-Terry reward model from preference pairs. | Build a reusable scorer for ranking or later RL. | halo-forge rm train |
| GRPO | Verifier-grounded policy gradient with group-relative advantages. | RLVR — code execution, math, tool calling, anything with a programmatic reward. | halo-forge grpo train |
| RAFT | Rejection-sampling RL: sample N, verify, SFT on the kept ones. | The simplest RLVR; works without KL terms or ratio clipping. | halo-forge raft train |
| VLM | Vision-language training. | Image Q&A, document extraction, screenshots, charts. | halo-forge vlm train |
| Audio | Audio/speech training. | ASR, classification, and audio-language tasks. | halo-forge audio train |
| Reasoning | Reasoning-specific training loop. | Math and multi-step answers. | halo-forge reasoning train |
| Agentic | Tool-use/function-calling training. | Structured outputs and tool traces. | halo-forge agentic train |
The post-training triad is SFT -> DPO/ORPO -> GRPO (plus optional RAFT and RM). Run them in order; each builds on the previous artifact.
Backend dispatch
Every trainer routes through halo_forge.<modality>._dispatch.get_<modality>_trainer(config). The dispatcher:
- Reads the active backend (
halo_forge.backend.get_backend()), or honors--acceleratorif set. - Picks the right trainer class for that backend.
- Returns the constructed trainer; the trainer’s
__init__validates the config against what that backend can honor — seehalo_forge.utils.backend_config.warn_unsupported_for_mlx.
So setting --use-dora on an MLX host doesn’t silently fall back to vanilla LoRA — the trainer prints a warning at init pointing at the limitation.
SFT
halo-forge sft train \
--dataset codealpaca \
--model Qwen/Qwen2.5-Coder-3B \
--epochs 3 --batch-size 2 --gradient-accumulation 16 \
--lora-rank 16 --lora-alpha 32 --use-dora \
--learning-rate 2e-4 --optim adamw_torch
LoRA / PEFT options. All four advanced PEFT methods are available on PyTorch backends:
--use-dora— DoRA decomposition (magnitude + direction)--use-rslora— rank-stabilized LoRA scaling (alpha/√r)--init-lora-weights pissa— PiSSA initialization (faster convergence)--init-lora-weights loftq|olora|gaussian|true|false— other init strategies
MLX SFT supports vanilla LoRA only; the other PEFT flags warn at init.
Optimizer choice. --optim is forwarded to transformers.TrainingArguments.optim:
adamw_torch(default) — works everywhereadamw_bnb_8bit— bitsandbytes 8-bit AdamW; halves optimizer memory. CUDA/ROCm-with-bnb only.lion_8bit,paged_adamw_8bit,paged_adamw_32bit— bitsandbytes variants
Datasets. Built-in short names (run halo-forge sft datasets): codealpaca, metamath, gsm8k_sft, llava, xlam_sft, glaive_sft, …. Or pass any HuggingFace dataset id, or --data path/to/file.jsonl for a local file.
DPO
halo-forge dpo train \
--dataset ultrafeedback \
--model Qwen/Qwen2.5-3B-Instruct \
--beta 0.1 --loss-type sigmoid \
--learning-rate 5e-6 --epochs 1
Algorithm knobs.
--beta— KL-regularization strength against the reference model (0.1 canonical; 0.05 gentler; 0.5 sharper)--loss-type—sigmoid(canonical),ipo,hinge,kto_pair,rpo--label-smoothing— cDPO smoothing (0.0 disables)--reference-free— skip loading the reference model; use a frozen policy at step 0 instead. Saves memory and is still the lowest-risk MLX mode.
Datasets. Built-in short names: ultrafeedback, orca_dpo, hh_rlhf, py_dpo. JSONL files need prompt, chosen, rejected columns; UltraFeedback’s chat-list format is auto-collapsed.
MLX DPO scope. MLX supports sigmoid, ipo, hinge, and kto_pair in reference-free and reference-model modes. rpo remains on the PyTorch/TRL path.
GRPO
halo-forge grpo train \
--data prompts.jsonl \
--model Qwen/Qwen2.5-Coder-3B \
--verifier execution --num-generations 8 \
--beta 0.04 --epsilon 0.2 \
--rollout-engine vllm # or 'mlx' on Apple Silicon
Algorithm knobs.
--num-generations— group size; how many completions per prompt. 4-8 is typical.--beta— KL strength (0.04 = DeepSeek-R1 default; 0 disables KL).--epsilon— PPO ratio clip (0.2 standard).--temperature— rollout sampling temperature (0.9 default; higher = more diverse groups).--no-scale-rewards— flip from canonical GRPO (advantage / std) to RLOO-flavored (mean baseline only).--reference-free— skip the reference model; saves memory. MLX supports both reference-free and reference-model GRPO.
Verifier integration. --verifier <short_name> resolves through the V1 plugin registry. Pass execution, pytest, humaneval, llm_judge, json_schema, or any @register_verifier-decorated class. The trainer instantiates and bridges to TRL’s reward_funcs (PyTorch) or the manual scoring loop (MLX).
Rollout engine (--rollout-engine):
auto(default) — torch fallback (HF generate)torch— same as auto; explicitvllm— continuous-batched on CUDA/ROCm. 5-10× faster generation. gfx1151 is experimental.mlx— Apple Silicon native viamlx_lm.generate
RAFT
The original RLVR algorithm in halo-forge. Sample N, verify, SFT on the kept samples — no KL term, no ratio clipping, just iterated rejection sampling.
halo-forge raft train \
--prompts data/prompts.jsonl \
--model Qwen/Qwen2.5-Coder-3B \
--verifier execution \
--cycles 5 --samples-per-prompt 8 --keep-percent 0.3 \
--rollout-engine vllm
MLX RAFT. Two paths:
--accelerator mlx --rollout-only— hybrid: MLX rollouts + PyTorch policy update.--accelerator mlx(no--rollout-only) — full MLX-native RAFT (Phase 5b).
Comparing trainers
| SFT | DPO | GRPO | RAFT | |
|---|---|---|---|---|
| Data shape | (prompt, completion) | (prompt, chosen, rejected) | prompt only | prompt only |
| Reward source | label likelihood | preference labels | verifiers | verifiers |
| Wall-clock per step | low | medium (2× forward) | high (N forwards + verify) | high (N forwards + verify) |
| Sample efficiency | high (every token contributes) | medium | high (advantages reweight all samples) | low (drops 70%+ via threshold) |
| KL term | no | yes | yes | no |
| MLX native | ✅ | ✅ reference-free/reference-model | ✅ reference-free/reference-model | ✅ |
Recommended pipeline for a new task.
- SFT on a domain dataset to align format + style. 1-3 epochs, LR 2e-4.
- DPO on a preference-pair dataset to align quality. 1 epoch, LR 5e-6, β=0.1.
- GRPO with a programmatic verifier to optimize the actual objective. 1 cycle, LR 1e-6, β=0.04, num_generations=8.
Output shape
Every trainer writes a training_summary.json to --output with the same schema:
{
"modality": "dpo",
"model_name": "Qwen/Qwen2.5-3B-Instruct",
"run_id": "dpo-1730000000000",
"seed": 42,
"cycles": [{"cycle": 0, "train_loss": ..., "cycle_duration_seconds": ..., ...}],
"total_train_steps_executed": 200,
"final_train_loss": 0.42,
"weights_updated": true,
"effectiveness": {"verdict": "passed", "reasons": [...]},
"yield_diagnostics": {...},
"recovery": {...},
"extra": {"beta": 0.1, "loss_type": "sigmoid", "reward_accuracy_history": [...]}
}
The same shape powers the run database (/runs/search), the run-detail UI, and the deterministic replay manifest. See docs/REPLAY.md for the replay contract.