Replay manifests
`halo-forge replay <run_dir>` regenerates the exact launch command for a captured run, optionally relaunching it. Every shipped trainer writes a `replay.json` manifest next to the `training_summary.json`, capturing every input that influenced the run.
halo-forge replay <run_dir> regenerates the exact launch command for a captured run, optionally relaunching it. Every shipped trainer writes a replay.json manifest next to the training_summary.json, capturing every input that influenced the run.
halo-forge replay ./models/dpo
# Reproducible launch command:
# halo-forge dpo train --model Qwen/Qwen2.5-3B --dataset ultrafeedback ...
halo-forge replay ./models/dpo --launch # relaunch in a subprocess
halo-forge replay ./models/dpo --launch --force # relaunch even on env drift
What the manifest captures
{
"manifest_version": 1,
"run_id": "dpo-1730000000000",
"modality": "dpo",
"timestamp": "2026-05-07T11:42:00+0000",
"model_name": "Qwen/Qwen2.5-3B-Instruct",
"seed": 42,
"pythonhashseed": "42",
"config": { /* full DPOConfig as dict */ },
"dataset": {
"kind": "huggingface",
"id": "HuggingFaceH4/ultrafeedback_binarized",
"revision": null
},
"environment": {
"python": "3.13.12",
"platform": "Darwin arm64",
"backend": "mlx",
"packages": {
"halo_forge": "1.4.0",
"torch": "2.5.0",
"mlx": "0.31.2",
"mlx_lm": "0.31.3",
"transformers": "4.49.0",
"peft": "0.14.0",
"trl": "0.29.1",
...
}
},
"cli_args": ["dpo", "train", "--model", "Qwen/Qwen2.5-3B-Instruct", ...]
}
Captured
- Run identity:
run_id,modality,timestamp,model_name. - Seed bundle: training seed +
PYTHONHASHSEED+ accelerator seed. Re-applied at replay time viaset_global_seed. - Dataset identity:
- Local files → SHA-256 + size, streamed so multi-GB JSONLs don’t OOM.
- HuggingFace → id + revision, so a renamed/updated dataset is detected.
- Full config snapshot (dataclass → dict).
- Environment fingerprint: Python version, platform, active backend, and pinned versions of
halo_forge,torch,mlx,mlx_lm,transformers,peft,trl,accelerate,datasets,bitsandbytes,vllm,numpy. - Literal argv that produced the run.
Deliberately not captured
- Wall-clock timing (non-deterministic across hosts).
- GPU compute precision (handled by
torch.use_deterministic_algorithmsat replay time). - Full dataset contents — we hash instead. A divergent dataset surfaces as a hash mismatch at replay.
Environment diff
halo-forge replay <run_dir> automatically diffs the captured environment fingerprint against the active host. Sample output:
Environment differs from the captured run:
python: '3.13.12' -> '3.12.5'
platform: 'Darwin arm64' -> 'Linux x86_64'
backend: 'mlx' -> 'cuda'
packages.torch: '2.5.0' -> '2.4.0'
packages.trl: '0.29.1' -> '0.31.0'
Drift doesn’t refuse the replay (sometimes re-running on a different host is the point — comparing reproducibility across MLX vs CUDA, or validating that a torch upgrade didn’t change behavior). It does loud-warn so you have the information to interpret divergence.
--launch --force skips the env-match gate.
Why dataset hashing matters
The classic reproducibility failure: a training run completes, six months pass, the dataset gets quietly rebuilt with new examples, and the “replay” silently trains on different data. The hash catches this:
- Local-file datasets: SHA-256 over the file. Change one byte → hash differs.
- HF datasets: id + revision (when set). Newer revision → mismatch.
A dataset.sha256 mismatch at replay is the single most informative signal for “this isn’t actually a replay.”
Programmatic API
from halo_forge.replay import (
capture_manifest, save_manifest, load_manifest,
compare_environments, EnvironmentFingerprint,
)
manifest = capture_manifest(
run_id=run.run_id,
modality="dpo",
model_name=cfg.model_name,
seed=cfg.seed,
config=cfg,
dataset_id="HuggingFaceH4/ultrafeedback_binarized",
cli_args=sys.argv[1:],
)
save_manifest(manifest, output_dir)
# At replay time:
loaded = load_manifest(output_dir)
diff = compare_environments(loaded.environment, EnvironmentFingerprint.capture().to_dict())
if not diff["matched"]:
print("Environment drift:", diff["differences"])
Manifest versioning
MANIFEST_VERSION = 1. A future-version manifest still loads; the loader warns and serves the seed + config (the load-bearing fields for replay) without refusing. Older clients can replay newer-format manifests as long as the field names they care about are stable.
Storage
Manifests live next to training_summary.json:
models/dpo/
├── adapter_config.json
├── adapters.safetensors
├── final_model/
├── replay.json ← here
└── training_summary.json
About 5-10 KB per manifest. The dataset SHA-256 is the dominant fixed cost; the env fingerprint is small.
What replay catches that other approaches miss
- Dataset swap — caught by the hash mismatch.
- Code drift — the
halo_forgepackage version is captured; a non-trivial version bump surfaces as env diff. - Backend-detection drift —
environment.backendcaptures what halo-forge actually selected; useful for debugging “it ran on MLX last time, why is it picking MPS now?” - Seed regression — random seeds are explicit in the manifest, not implicit in
--seed 42getting silently dropped.
Roadmap
- Compute-precision capture — record
torch.use_deterministic_algorithmsstate, cudnn determinism flags, MLX deterministic mode if/when MLX exposes one. - Run forking (F-Q) — replay manifests will be the foundation; “fork from cycle 3 with one knob changed” needs a place to declare what was forked from what.
- Cross-machine replay verification — score divergence across hosts; surface “your reproduction differs from the captured run’s checkpoint by X%”.