Choose a Training Method
Pick the right Halo Forge trainer from the dashboard or CLI
Halo Forge now exposes the same training surface in the dashboard and CLI. Use Start for the first safe SFT run, then use Train when you know the goal and need a specific method.
| Goal | Start with | Move to | Data shape |
|---|---|---|---|
| Domain/style adaptation | SFT | DPO or ORPO | prompt/completion |
| Code with executable checks | SFT | RAFT or GRPO | prompts plus verifier |
| Preference alignment | DPO or ORPO | RM, then GRPO | prompt/chosen/rejected |
| Verifier-grounded reasoning | SFT or reasoning | GRPO | prompts plus verifier |
| Vision-language | VLM | VLM with verifier gating | image + prompt + answer |
| Audio | Audio | Audio with task verifier | audio + transcript/label |
| Tool use | Agentic | GRPO with schema/tool verifier | messages/tool calls |
Dashboard Flow
- Open Train.
- Choose a goal: Code, Reasoning, Tool use, Vision, Audio, or Preferences.
- Choose the method Halo Forge should run.
- Review the generated launch, preflight, output path, and backend notes.
- Launch, then monitor the run from Runs and inspect artifacts in Results.
The dashboard intentionally hides unusual flags until the advanced drawer is opened. The CLI remains available for exact reproducibility and scripting.
Method Guide
- SFT learns from labeled examples. Use it first unless you already have preference data.
- RAFT generates multiple answers, verifies them, keeps the useful ones, then trains.
- DPO uses chosen/rejected pairs to improve behavior without an explicit reward model.
- ORPO uses the same pair data as DPO but skips the reference model.
- RM trains a reward scorer from chosen/rejected pairs.
- GRPO uses a verifier as reward and performs group-relative policy updates.
- VLM, audio, reasoning, and agentic are domain-specific training surfaces with capability gates where needed.