Reward Models

Train a scorer from chosen/rejected examples

Reward-model training builds a scorer that estimates which answer is better for a prompt. It uses the same preference-pair data as DPO and ORPO.

Dashboard

Open Train, choose Preferences, then choose Reward model. Use this when you need a reusable scorer for later ranking or RL workflows.

CLI

halo-forge rm train --dataset ultrafeedback --model Qwen/Qwen2.5-3B-Instruct --output ~/.halo-forge/runs/rm-chat

Outputs

Reward-model runs write training_summary.json, logs, and a final model artifact when training completes. Results shows the local workstation path even when the browser cannot open the file directly.