Documentation

Cross-vendor local finetuning workstation — SFT, DPO, GRPO, RAFT, RM with verifier-grounded rewards on ROCm, CUDA, Apple MLX, Apple MPS.

What halo-forge is

A workstation tool that takes a base model and turns it into a finetuned, evaluated, served artifact — without leaving the local machine.

The single thing that makes it different from every adjacent project (axolotl, llama-factory, unsloth, mlx-lm-lora, torchtune): it runs natively on every modern accelerator, not just CUDA.

Pick a goal. Choose a catalog model. Pick an algorithm and verifier. Train, evaluate, serve, and save the run into a bundle when it is worth comparing.

Start By Intent

I want to…	Start here
Train my first local model	Quick Start
Control a training workstation remotely	Public Frontend: remote workstation
Pick the right base model	Choose a Model
See runnable examples	Usage Scenarios
Run on Apple Silicon	Hardware Notes and Apple Silicon MLX Quickstart
Serve or export a trained artifact	Serve / convert / merge

Capabilities

Trainers

SFT — supervised finetuning with QLoRA / LoRA / DoRA / rsLoRA / PiSSA. PyTorch on every torch backend; MLX-native on Apple Silicon.
DPO — preference optimization (sigmoid / IPO / hinge / KTO-pair / RPO / cDPO). PyTorch via TRL; MLX-native DPO supports sigmoid, IPO, hinge, and KTO-pair in reference-free and reference-model modes.
GRPO — verifier-grounded policy gradient (DeepSeek-R1 / Tülu 3 family). PyTorch via TRL; MLX-native reference-free and reference-model GRPO.
RAFT — rejection-sampling RLVR with curriculum + reward shaping. PyTorch + native MLX.
Reward Model — Bradley-Terry RM from preference pairs. Becomes a learned verifier for any other modality.

Verifiers

Pluggable registry — drop a .py in ~/.halo-forge/verifiers/ or use @register_verifier. Out of the box:

Execution & compile: gcc, clang, mingw, execution, pytest, humaneval, mbpp, rust, cargo, go, custom, subprocess
Schema & format: json_structure, json_schema, regex_format
Reference metrics: bleu, rouge, chrf
LLM-as-judge: llm_judge — rubric-graded with any local or hosted judge model

Data pipeline

Synthesize — generate completions from seed prompts via a teacher model + verifier filter.
Dedup — exact (SHA-256) + fuzzy (MinHash + LSH).
Score — heuristic quality scoring + threshold / top-K filter.
Compose — synthesize → dedup → score → filter is the four-command pre-finetune sequence.

Inference + serving

OpenAI-compatible serving — halo-forge serve --model X exposes /v1/chat/completions, /v1/completions, /v1/models.
Unified convert — halo-forge convert --format mlx|gguf|hf --quant q4|q8|fp16|bf16|fp32
Round-trip verify — halo-forge convert --verify catches silently-broken exports.
vLLM rollout — continuous-batched generation on CUDA/ROCm.
MLX rollout — Apple Silicon equivalent via mlx_lm.generate.

Evaluation

lm-evaluation-harness — halo-forge eval --tasks core runs MMLU / GSM8K / HumanEval / IFEval / ARC etc.
Mid-training probe — halo-forge probe runs a small held-out benchmark and diffs against a baseline; catches catastrophic forgetting in single-digit minutes.

Reproducibility

Replay manifests — halo-forge replay <run_dir> regenerates the exact launch command.
Sweep infrastructure — Optuna-style hyperparameter search with random / TPE / grid samplers.

Run management

SQLite run database — search / filter / sort / paginate runs.
Multi-run comparison — pin runs, overlay loss + reward curves, side-by-side config diff.
Cohort eval dashboard — runs × tasks grid; best-per-task highlighted.
Cost rollup — per-run kWh + $ from wall-clock × backend nominal power.
Live telemetry strip — SSE-streamed GPU util / VRAM / power / throughput.
Remote workstation — non-loopback access uses bearer tokens and controls one Halo Forge host.

Adapter merging

Bake — single LoRA into base, output is a standard HF checkpoint.
Combine — N adapters via linear / ties / dare_linear / dare_ties / magnitude_prune.

Auth + multi-user

API tokens — bearer-token auth, automatic when bound to non-loopback. Local-first stays zero-config.

Getting started

Quick Start — Install + first run
Choose a Model — Model catalog, Liquid AI caveats, and first picks
Usage Scenarios — Code, preference, reasoning, VLM, audio, agentic, serve/export
Hardware Notes — Per-backend recommendations + feature matrix
Remote Workstation — Token-authenticated browser access to one training host

Trainers

Overview — Choosing between SFT / DPO / GRPO / RAFT / RM

Verifiers

Reproducibility

Reference

Background

Theory & Research — RLVR foundations
Graduated Rewards — Partial credit
Learning Rate Strategies — LR per algorithm

Choose a Training Method

Pick the right Halo Forge trainer from the dashboard or CLI

Command Index

Complete index of all halo-forge commands and flags

Full Pipeline

Complete guide to training a code generation model

Quick Start

Three practical paths from install to first useful Halo Forge run

Choose a Model

How to pick a base model for SFT, RAFT, DPO, GRPO, VLM, audio, and serving

Data Generation

Preparing training data for SFT and RAFT

SFT Training

Supervised fine-tuning to establish baseline capability

Toolbox Setup

Build and configure the halo forge container environment

Graduated Rewards

Why partial credit matters for RLVR training

RAFT Training

Reward-Ranked Fine-Tuning with compiler verification

Learning Rate Strategies

Experimental learning rate recommendations for RAFT training

Windows Build Server

Configure a Windows machine for MSVC verification

Benchmarking

Evaluate model performance with pass@k metrics

Web UI

Dashboard for training, benchmarking, and monitoring

Model Catalog Reference

Catalog schema, model family status, and compatibility guidance

Production Training Runs

Step-by-step commands for training all model sizes on the Windows Systems Programming dataset

Public Frontend

User-facing local and remote workstation surface for training, monitoring, results, and docs

Halo-forge ships four post-training algorithms. They share a common config / dispatch / output shape so the public API and frontend treat every run the same way regardless of which algorithm produced it.

Preference Tuning

DPO and ORPO training from chosen/rejected examples

Code Datasets

Reward Models

Train a scorer from chosen/rejected examples

GRPO

Verifier-grounded RL with group-relative advantages

Vision-Language Training

VLM training for image and text tasks

Audio Training

Audio and speech training paths

Reasoning Training

Math and multi-step reasoning training

Data pipeline

Three operations close the gap between "I have prompts" and "I have a training-ready dataset":

Tool-Use And Agentic Training

Function-calling and structured tool-use training

Dataset Formats

Data shapes expected by each training method

Dashboard Training

Use the Halo Forge dashboard as the primary operator surface

Training Artifacts

Files written by Halo Forge training runs

Apple Silicon MLX Quickstart

Evaluation

`halo-forge eval` wraps EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) so a halo-forge-trained model can be benchmarked against the published academic suites with one command and one consistent result shape.

Inference + serving

Halo-forge ships three commands that close the train → ship loop without leaving the local machine:

Replay manifests

`halo-forge replay <run_dir>` regenerates the exact launch command for a captured run, optionally relaunching it. Every shipped trainer writes a `replay.json` manifest next to the `training_summary.json`, capturing every input that influenced the run.

What halo-forge is

Start By Intent

Capabilities

Trainers

Verifiers

Data pipeline

Inference + serving

Evaluation

Reproducibility

Run management

Adapter merging

Auth + multi-user

Quick navigation

Getting started

Trainers

Verifiers

Data pipeline

Evaluation

Inference + serving

Reproducibility

Reference

Background

Meta

Choose a Training Method

Command Index

Configuration

Full Pipeline

Quick Start

Theory & Research

Choose a Model

Data Generation

SFT Training

Toolbox Setup

Troubleshooting

Graduated Rewards

Hardware Notes

RAFT Training

Usage Scenarios

Learning Rate Strategies

Windows Build Server

Benchmarking

Web UI

Model Catalog Reference

Production Training Runs

Public Frontend

Trainers

Preference Tuning

Code Datasets

Reward Models

GRPO

Vision-Language Training

Audio Training

Reasoning Training

Data pipeline

Tool-Use And Agentic Training

Dataset Formats

Dashboard Training

Training Artifacts

Apple Silicon MLX Quickstart

Evaluation

Inference + serving

Replay manifests

Hyperparameter sweeps

Auth + tokens

Modalities & experimental features

Changelog

Contributing

How to Train

Verifiers