Changelog

All notable changes to halo forge

[1.4.0] - 2026-05-14

Major release. Halo-forge becomes a cross-vendor local finetuning workstation with an acceptance-backed Apple Silicon MLX path.

Added

Trainers

  • DPO (halo-forge dpo train) — preference optimization via TRL on PyTorch backends; MLX-native DPO on Apple Silicon supports sigmoid, IPO, hinge, and KTO-pair in reference-free and reference-model modes. RPO remains a PyTorch path.
  • GRPO (halo-forge grpo train) — verifier-grounded policy gradient with group-relative advantages. PyTorch via TRL; MLX-native reference-free and reference-model paths. DeepSeek-R1 / Tülu 3 family.
  • Reward Model (halo-forge rm train) — Bradley-Terry RM from preference pairs. Closes the RLHF loop for non-code modalities.
  • MLX-native DPO + GRPO — DPO supports sigmoid, IPO, hinge, and KTO-pair in reference-free and reference-model modes; GRPO supports reference-free and reference-model eager updates with rollouts via mlx_lm.generate.

PEFT + optimizers

  • DoRA / rsLoRA / PiSSA / LoftQ / OLoRA--use-dora, --use-rslora, --init-lora-weights pissa etc. on every torch backend.
  • bitsandbytes optimizers--optim adamw_bnb_8bit, lion_8bit, paged_adamw_8bit. CUDA / ROCm only.

Verifiers

  • Plugin registry (V1) — @register_verifier decorator + ~/.halo-forge/verifiers/*.py discovery + entry-point packages.
  • LLM-as-judge (V2) — rubric-graded with pluggable judge callable. Default targets a local halo-forge serve endpoint.
  • Schema verifiers (V3) — json_structure, json_schema, regex_format.
  • Reference-metric verifiers (V4) — bleu, rouge, chrf.

Data pipeline

  • Synthesize (halo-forge data synthesize) — teacher → verifier → filter pipeline. SFT or preference data shape.
  • Dedup (halo-forge data dedup) — exact (SHA-256) + fuzzy (MinHash + LSH).
  • Quality scoring (halo-forge data score) — five-component heuristic + threshold / top-K filter.

Inference + serving

  • OpenAI-compatible servinghalo-forge serve --model X exposes /v1/chat/completions, /v1/completions, /v1/models, and a Halo Forge /health status endpoint.
  • Streaming responses — chat and text completions support OpenAI-shaped text/event-stream chunks ending in data: [DONE].
  • Unified converthalo-forge convert --format mlx|gguf|hf --quant q4|q8|fp16|bf16|fp32.
  • Round-trip verifyhalo-forge convert --verify catches silently-broken exports.
  • vLLM rollout--rollout-engine vllm for CUDA / ROCm.
  • MLX rollout--rollout-engine mlx for Apple Silicon parity.
  • Adapter mergehalo-forge merge --mode bake|combine (linear / ties / dare_linear / dare_ties / magnitude_prune).

Evaluation

  • lm-evaluation-harnesshalo-forge eval --tasks core with curated task groups (core / reasoning / code / instruction_following / knowledge).
  • Mid-training probehalo-forge probe --baseline ./baseline.json runs a small held-out benchmark + reports deltas; exits 2 on regression.

Reproducibility + management

  • Replay manifestshalo-forge replay <run_dir> regenerates the launch command. Captures seed bundle, dataset hash, env fingerprint, full config.
  • Hyperparameter sweeps — programmatic library with random / TPE / grid samplers, Uniform / LogUniform / Choice distributions, sweep-level early stopping.
  • SQLite run database/runs/search endpoint with filter / sort / paginate / facet support.
  • Cohort eval dashboard/eval route renders runs × tasks grid; best-per-task highlighted.
  • Cost rollup — per-run kWh + $ from wall-clock × backend nominal power. UI panel + HALOFORGE_COST_PER_KWH env var.
  • API token auth — bearer-token gate that turns on automatically when bound to non-loopback. halo-forge token create / list / revoke.

Frontend

  • Vite + React 19 + Tanstack Router — replaces the retired NiceGUI surface.
  • First Run Experience v2 — guided /start flow with backend detection, MLX readiness, safe model recommendation, preflight, launch, and route-to-run behavior.
  • Model Catalog v2 — curated static catalog with first-run ranking, memory estimates, license/download notes, fit notes, and risk levels.
  • Live run view — cycle-by-cycle loss + reward charts, scrubber, log tail, sample inspector, cancel button.
  • Multi-run comparison — pin up to 6 runs, overlay loss / reward, side-by-side config diff.
  • Run search — DB-backed filter chips for modality / status / model / has-eval.
  • Cohort eval — runs × tasks grid over pinned runs.
  • Energy & spend card — kWh + $ per run.

Removed

  • NiceGUI web UI — retired in favor of public_app/. The ui/services/ and ui/state.py modules survive as the service layer the public_api consumes.
  • halo-forge ui CLI subcommand.
  • nicegui dependency.

Changed

  • Documentation — README rewritten; new feature-area docs (TRAINERS, DATA, EVAL, SERVING, REPLAY); per-backend feature × backend matrix in HARDWARE_NOTES.
  • Backend matrix — every shipped feature is documented with its actual backend support status (✅ / ⚠️ / ❌). Silent-failure paths for MLX-with-PEFT-flags now warn loudly at trainer init.
  • MLX productizationhalo-forge doctor mlx, Terminal smoke validation, and acceptance evidence document the supported Apple Silicon path for SFT, RAFT, DPO, and GRPO.

Known limitations

  • Direct CLI users still choose MLX explicitly with --accelerator mlx; the dashboard recommends MLX only when the readiness probe is executable.
  • mx.compile remains measurement-only. Eager trainer paths are the production default.
  • Speculative decoding and backend-native token streaming are later serving tracks.
  • Halo Forge surfaces chip, memory, and catalog facts but does not auto-tune batch size, LoRA rank, or trainer defaults from chip tier.

[1.3.0] - 2026-01-21

Added

  • Web UI Verifier Integration - Verifier test page now calls real backend verifiers instead of returning hardcoded results
  • Branding - Halo-forge favicon and sidebar logo integrated into web UI
  • Static Asset Serving - UI properly serves static files from ui/static/
  • SFT --no-gradient-checkpointing CLI flag - Control gradient checkpointing from UI and CLI

Fixed

  • SFT Dataset Routing - Local .jsonl files now correctly use --data flag; HuggingFace IDs use --dataset
  • RAFT Verifier Alignment - UI verifier choices now match CLI --verifier options exactly
  • MBPP Verifier - Natural language prompts no longer cause syntax errors during execution

Changed

  • Removed unused RAFT learning rate UI field (CLI uses lr-decay schedule, not initial LR)

[1.2.0] - 2026-01-10

Added

Auto-Logging

  • Automatic log capture - All training and benchmark commands now automatically log to logs/ with timestamped filenames
  • --quiet flag - Suppress terminal output while still writing to log file
  • No more need for manual tee or PYTHONUNBUFFERED=1

New RAFT CLI Flags

  • --samples-per-prompt - Control samples generated per prompt (default: 8)
  • --temperature - Set generation temperature (default: 0.7)
  • --max-new-tokens - Limit generation length (default: 1024)
  • --min-samples - Auto-adjust threshold if too few samples pass filtering

Preset Config Files

  • configs/raft_conservative.yaml - Safe training: 80% keep, slow LR decay, min 200 samples
  • configs/raft_aggressive.yaml - Strict filtering: 30% keep, 16 samples/prompt, 0.8 temp
  • configs/vlm_example.yaml - VLM RAFT with perception/reasoning/output weights
  • configs/audio_example.yaml - Audio RAFT for ASR/TTS
  • configs/reasoning_example.yaml - Math/reasoning RAFT

Module Consistency

  • Added missing flags to all domain modules (VLM, Audio, Reasoning, Agentic)
  • Consistent --samples-per-prompt, --temperature, --keep-percent, --reward-threshold across all train commands

Changed

  • Added humaneval, mbpp, python to verifier choices in CLI
  • Improved base model loading for LoRA checkpoints (reads from adapter_config.json)
  • Fixed code extraction to strip input tokens from generated completions

[1.1.0] - 2026-01-08

Added

Unified SFT Pipeline

  • halo-forge sft train --dataset - Load HuggingFace datasets directly
  • halo-forge sft datasets - List all available SFT datasets
  • Domain-specific SFT commands for all modules:
    • halo-forge vlm sft
    • halo-forge audio sft
    • halo-forge reasoning sft
    • halo-forge agentic sft

SFT Datasets Module

  • New halo_forge/sft/datasets.py with dataset registry
  • Short name support (e.g., codealpaca, metamath, llava)
  • Auto-formatting to ChatML for HuggingFace datasets
  • --max-samples flag to limit dataset size
  • --dry-run for validation

Supported SFT Datasets

DomainDatasetHuggingFace IDSize
Codecodealpacasahil2801/CodeAlpaca-20k20K
Codecode_instructions_122kTokenBender/code_instructions_122k122K
Reasoningmetamathmeta-math/MetaMathQA395K
Reasoninggsm8k_sftgsm8k8.5K
VLMllavaliuhaotian/LLaVA-Instruct-150K150K
Audiolibrispeech_sftlibrispeech_asr100h
Agenticxlam_sftSalesforce/xlam-function-calling-60k60K
Agenticglaive_sftglaiveai/glaive-function-calling-v2113K

Agentic / Tool Calling Training (Phase 6)

  • New halo_forge/agentic/ module for tool calling RLVR training
  • AgenticRAFTTrainer for RAFT training on function calling
  • Hermes format support (Qwen2.5, NousHermes compatible)
  • TensorBoard integration via MetricsTracker

Tool Calling Verifier

  • ToolCallingVerifier with graduated reward structure
  • JSON validation and schema compliance
  • Function name and argument matching
  • Irrelevance detection (penalizes false positives)
  • Support for parallel and multi-turn tool calls

Tool Calling Dataset Loaders

  • XLAMLoader - 60k verified samples, 3,673 APIs
  • GlaiveLoader - 113k samples with irrelevance detection
  • HermesFormatter for converting to standard format

CLI Commands

  • halo-forge agentic train - Train tool calling with RAFT
  • halo-forge agentic benchmark - Benchmark on tool calling
  • halo-forge agentic datasets - List available datasets

Improved

  • Consistent SFT → RAFT → Benchmark pipeline for ALL modules
  • Consistent CLI banner and colors across all modules
  • MetricsTracker integration for TensorBoard logging
  • 32 new unit tests for agentic module

[1.0.0] - 2026-01-08

Added

Audio Training (Phase 4)

  • New halo_forge/audio/ module for audio-language RLVR training
  • AudioRAFTTrainer for RAFT training on audio models
  • Multi-task verification: ASR, TTS, Audio Classification

Audio Verifiers

  • AudioVerifier base class inheriting from core Verifier
  • ASRChecker for speech-to-text with WER/CER metrics
  • TTSChecker for text-to-speech quality (UTMOS-based)
  • AudioClassificationChecker for sound event detection

Audio Model Adapters

  • WhisperAdapter for OpenAI Whisper models
  • Wav2VecAdapter for wav2vec2 models
  • Automatic dtype handling and attention mask generation

Audio Dataset Loaders

  • LibriSpeechLoader - 960h clean audiobook speech
  • CommonVoiceLoader - Multilingual crowdsourced audio
  • AudioSetLoader - 5M clips for classification
  • SpeechCommandsLoader - Keyword spotting dataset

Math/Reasoning Training (Phase 5)

  • New halo_forge/reasoning/ module for mathematical reasoning
  • ReasoningRAFTTrainer for reasoning task training
  • SymPy-based answer verification

Reasoning Verifiers

  • ReasoningVerifier base class inheriting from core Verifier
  • MathVerifier with numeric and symbolic comparison
  • AnswerExtractor for parsing answers from completions
  • Support for \boxed{}, “The answer is”, and numeric formats
  • Partial credit for showing reasoning steps

Reasoning Dataset Loaders

  • GSM8KLoader - 8.5K grade school math problems
  • MATHLoader - 12.5K competition math problems
  • Support for difficulty levels and subject filtering

CLI Commands

  • halo-forge audio train - Train audio models with RAFT
  • halo-forge audio benchmark - Benchmark on audio datasets
  • halo-forge audio datasets - List audio datasets
  • halo-forge reasoning train - Train on math datasets
  • halo-forge reasoning benchmark - Math benchmarking
  • halo-forge reasoning datasets - List reasoning datasets

Architecture Improvements

  • All verifiers now inherit from base Verifier class
  • Consistent verify() -> VerifyResult interface across domains
  • Unified VerifyResult dataclass with success, reward, error

Changed

  • Updated all containers to v1.0.0
  • Removed torchcodec dependency (using torchaudio/librosa directly)
  • Improved audio loading with graceful fallback to librosa
  • Consistent CLI banner and colors across all commands

Fixed

  • CLI subcommand dispatch issue causing empty output
  • Build script argument parsing for --tag option
  • Whisper dtype mismatch causing float/half errors
  • VLM preprocessor returning 4D tensors instead of 3D

[0.5.0] - 2026-01-07

Added

Vision-Language Model Training (Phase 3)

  • New halo_forge/vlm/ module for VLM RLVR training
  • VLMRAFTTrainer for RAFT training on VLMs
  • Multi-stage verification pipeline for VLM outputs

VLM Verifiers

  • VisionVerifier combining perception, reasoning, and output verification
  • PerceptionChecker with YOLOv8 object detection and EasyOCR
  • ReasoningChecker for chain-of-thought validation
  • OutputChecker for answer matching (exact, fuzzy, semantic)
  • Specialized verifiers: VQAVerifier, DocVQAVerifier, ChartQAVerifier

VLM Model Adapters

  • QwenVLAdapter for Qwen-VL and Qwen2-VL models
  • LLaVAAdapter for LLaVA model family
  • GenericVLMAdapter for other HuggingFace VLMs
  • Auto-detection of appropriate adapter from model name

VLM Dataset Loaders

  • TextVQALoader - Text reading in natural images
  • DocVQALoader - Document understanding
  • ChartQALoader - Chart interpretation
  • RealWorldQALoader - Real-world reasoning
  • MathVistaLoader - Mathematical reasoning with visuals
  • Export to RLVR and SFT formats

Image Processing

  • VLMPreprocessor for generic image preprocessing
  • QwenVLProcessor for Qwen-VL models
  • LLaVAProcessor for LLaVA models

CLI Commands

  • halo-forge vlm train - Train VLM with RAFT
  • halo-forge vlm benchmark - Benchmark VLM on datasets
  • halo-forge vlm datasets - List available VLM datasets

Changed

  • Updated changelog with Phase 3 features
  • Added VLM documentation pages to website

[0.4.0] - 2026-01-06

Added

Inference Optimization Mode

  • New halo_forge/inference/ module for model optimization
  • InferenceOptimizationVerifier for quality verification
  • InferenceOptimizer for end-to-end optimization pipeline
  • QATTrainer for quantization-aware training

Model Export

  • GGUFExporter for llama.cpp/Ollama deployment
  • ONNXExporter for cross-platform inference
  • Support for Q4_K_M, Q8_0, F16 quantization types

CLI Commands

  • halo-forge inference optimize - Optimize for deployment
  • halo-forge inference export - Export to GGUF/ONNX
  • halo-forge inference benchmark - Measure latency

Calibration

  • CalibrationDataset for calibration data handling
  • Support for synthetic calibration data generation

Changed

  • Updated CLI reference with inference commands
  • Added inference section to website documentation

[0.3.0] - 2026-01-06

Added

Learning Rate Decay

  • --lr-decay flag for exponential LR decay across RAFT cycles (default: 0.85)
  • --min-lr flag to set learning rate floor (default: 1e-6)
  • Prevents training degradation at cycles 7-8

Execution Verifier

  • New ExecutionVerifier for test case-based verification
  • Supports multiple test cases with input/output pairs
  • Graduated rewards: 0.5 + 0.5 × pass_rate
  • Match modes: exact, contains, regex, numeric
  • Pre-configured variants: GCCExecutionVerifier, ClangExecutionVerifier, MinGWExecutionVerifier

Multi-Language Support

  • New MultiLanguageVerifier with auto-detection
  • Detects: C++, C, Python, Rust, Go, C#, PowerShell
  • Use --verifier auto for automatic language detection
  • AutoVerifier alias for CLI convenience

New Verifiers

  • RustVerifier with Windows cross-compilation support
  • GoVerifier with Windows cross-compilation support
  • DotNetVerifier for C# compilation to Windows PE
  • PowerShellVerifier for script syntax validation

Metrics Tracking

  • MetricsTracker with TensorBoard integration
  • JSON logging for all cycle metrics
  • TrainingMonitor for early stopping detection
  • Automatic metrics.jsonl generation

Dataset Loaders

  • HumanEvalPlusLoader - 80x more test cases per problem
  • LiveCodeBenchLoader - Contamination-free benchmark

CLI Enhancements

  • halo-forge config validate command
  • --system-prompt flag for custom prompts
  • MSVC verifier validation with helpful error messages

Changed

  • Default system prompt updated to “You are an expert Windows systems programmer”
  • Improved PEFT adapter handling to prevent stacking
  • Category tracking now supports root-level fields in datasets

Fixed

  • PEFT adapter stacking bug in _reload_model()
  • “Unknown” category issue in benchmark results
  • MSVC verifier parameter validation

[0.2.0] - 2025-01-01

Added

  • halo-forge test command for pipeline validation
    • --level smoke: Quick imports/compiler check (no GPU)
    • --level standard: Model loading, generation, verification
    • --level full: Complete mini-RAFT cycle with training
  • halo-forge benchmark full command for comprehensive benchmarks
  • Graduated rewards (RewardLevel) for partial credit
  • Runtime verification (run_after_compile) for compile verifiers
  • Comprehensive verifier unit tests
  • Chunked verification in RAFT trainer to prevent OOM

Changed

  • Optimized for BF16 (4-bit quantization removed from defaults)
  • Updated all docs to reflect 128GB unified memory
  • Improved error messages in verifiers
  • SFT trainer now uses device_map="auto"

Fixed

  • Memory leak during RAFT verification
  • Gradient checkpointing warning during benchmark training

[0.1.0] - 2024-12-28

Added

  • Initial release
  • Custom toolbox with ROCm 7 nightly for gfx1151
  • Data generation module (public datasets + LLM generation)
  • SFT training with LoRA/BF16 support
  • RAFT training with pluggable verifiers
  • Benchmarking with pass@k metrics
  • Built-in verifiers: GCC, Clang, MinGW, MSVC, pytest, unittest
  • CLI with subcommands
  • Documentation