Verifiers
Pluggable verification system for RLVR training. Plugin registry, programmatic + schema + reference-metric + LLM-as-judge verifiers.
Verifiers are the heart of RLVR — they provide the reward signal that guides training.
Important: Verifiers are training infrastructure, not benchmarks. For benchmark reporting (comparing to papers), see Evaluation.
Plugin registry
Halo-forge ships a plugin registry for verifiers. Three coexisting registration paths funnel into one dict:
# 1. Decorator (programmatic)
from halo_forge.rlvr.verifiers import register_verifier, Verifier, VerifyResult
@register_verifier("my_check")
class MyVerifier(Verifier):
def verify(self, code: str) -> VerifyResult:
...
# 2. Plugin directory
# ~/.halo-forge/verifiers/my_check.py
# Auto-discovered on first registry access. Override path with
# HALOFORGE_VERIFIERS_DIR.
# 3. Entry-point packages (pip-installable plugins)
# [project.entry-points."halo_forge.verifiers"]
# my_check = "my_pkg.verifiers:MyVerifier"
The trainer / CLI / public-API consumers call get_verifier(name) instead of importing concrete classes:
from halo_forge.rlvr.verifiers import get_verifier
verifier_cls = get_verifier("my_check")
verifier = verifier_cls(...)
Use list_registered_verifiers() to see every name in the registry.
Built-in short names
Pass any of these as --verifier <name> to GRPO / RAFT / data synthesize:
| Category | Names |
|---|---|
| Code execution + compile | gcc, clang, mingw, execution, gcc_execution, mingw_execution, clang_execution, pytest, unittest, rlvr_pytest, humaneval, mbpp, rust, cargo, go, custom, subprocess |
| Schema + format | json_structure, json_schema, regex_format |
| Reference metrics | bleu, rouge, chrf |
| LLM-as-judge | llm_judge |
Verifiers vs Benchmarks
Verifiers and benchmarks serve different purposes:
| Aspect | Verifiers (Training) | Benchmarks (Reporting) |
|---|---|---|
| Purpose | Provide reward signal for RAFT | Compare model to published results |
| Output | Graduated rewards (0.0 to 1.0) | Metrics (pass@k, accuracy, WER) |
| When Used | During training loop | After training is complete |
| Tooling | Native halo-forge verifiers | Community tools (VLMEvalKit, etc.) |
Use verifiers when: Running RAFT training, need graduated feedback, debugging training
Use benchmarks when: Evaluating trained model, comparing to papers, publishing results
Built-in Verifiers
Compilation Verifiers
| Verifier | Language | Target | Compile | Run | Cross-Compile |
|---|---|---|---|---|---|
GCCVerifier | C/C++ | Linux ELF | Yes | Yes | - |
ClangVerifier | C/C++ | Linux ELF | Yes | Yes | - |
MinGWVerifier | C/C++ | Windows PE | Yes | No | - |
RemoteMSVCVerifier | C/C++ | Windows PE | Yes | Yes | Requires Windows server |
RustVerifier | Rust | Native/Windows | Yes | Yes | x86_64-pc-windows-gnu |
GoVerifier | Go | Native/Windows | Yes | Yes | GOOS=windows |
DotNetVerifier | C# | Windows PE | Yes | No | win-x64 |
PowerShellVerifier | PowerShell | Script | Syntax | No | - |
Test Verifiers
| Verifier | Language | Use Case |
|---|---|---|
PytestVerifier | Python | Code with tests |
UnittestVerifier | Python | unittest format |
HumanEvalVerifier | Python | HumanEval benchmark |
MBPPVerifier | Python | MBPP benchmark |
SubprocessVerifier | Any | Custom commands |
Basic Usage
from halo_forge.rlvr.verifiers import GCCVerifier
verifier = GCCVerifier()
result = verifier.verify(code)
print(result.success) # True/False
print(result.reward) # 0.0 - 1.0
print(result.details) # Human-readable message
Graduated Rewards
Binary rewards create sparse gradients. halo forge uses graduated rewards:
| Outcome | Reward | Signal |
|---|---|---|
| Syntax error | 0.0 | Completely wrong |
| Compiles with warnings | 0.3 | Close but imperfect |
| Compiles clean | 0.5 | Correct syntax |
| Runs without crash | 0.7 | Executable |
| Correct output | 1.0 | Fully correct |
from halo_forge.rlvr.verifiers import RewardLevel
# Get reward from compile result
reward = RewardLevel.from_compile_result(success=True, has_warnings=False)
# Returns 0.5
# Get reward from execution result
reward = RewardLevel.from_execution_result(
compiles=True,
runs=True,
correct=False
)
# Returns 0.7
Batch Verification
Verify multiple samples in parallel:
verifier = GCCVerifier(max_workers=8)
codes = [code1, code2, code3, ...]
results = verifier.verify_batch(codes) # Parallel execution
for result in results:
print(f"{result.reward}: {result.details}")
With RAFT Training
from halo_forge.rlvr import RAFTTrainer
from halo_forge.rlvr.verifiers import GCCVerifier
verifier = GCCVerifier(max_workers=8)
trainer = RAFTTrainer(
verifier=verifier,
sft_checkpoint="models/sft/final_model"
)
trainer.run(prompts, num_cycles=5)
Verifier Architecture
Verifier (base class)
│
┌───────────────────────┼───────────────────────┐
│ │ │
CompileVerifier TestVerifier CustomVerifier
│ │
┌────┼────┬────┬────┐ ┌────┴────┐
│ │ │ │ │ │ │
GCC MinGW Clang Rust Go Pytest Unittest
│
RemoteMSVC
DotNet
PowerShell
Chaining Verifiers
Run multiple verification stages:
from halo_forge.rlvr.verifiers import ChainedVerifier, GCCVerifier
verifier = ChainedVerifier([
GCCVerifier(), # Stage 1: Compile
GCCVerifier(run_after_compile=True), # Stage 2: Run
])
result = verifier.verify(code)
# Stops at first failure, accumulates rewards
Cleanup
Always cleanup resources:
verifier = GCCVerifier()
try:
results = verifier.verify_batch(codes)
finally:
verifier.cleanup()
# Or use context manager
with GCCVerifier() as verifier:
results = verifier.verify_batch(codes)
Verifier Safety
Verifiers execute model-generated code. Treat verifier runs as untrusted execution.
Current Safety Characteristics
| Verifier Type | Safety Notes |
|---|---|
SubprocessVerifier | Uses argv execution with shell=False |
| Python verifiers | Execute without OS-level resource limits |
| C/C++ verifiers | Apply setrlimit for CPU and memory during execution |
Operational Guardrails (Recommended)
- Isolate execution: Run verifiers in VMs or containers, avoid host mounts
- Non-privileged user: Do not run verifiers as root
- Constrain resources: Use OS limits or container quotas where possible
- Restrict network: Prefer offline execution or blocked egress
These are operational recommendations, not enforced by halo-forge itself.
Compile Verifiers
Multi-language compilation verification
Custom Verifiers
Create your own verification logic
Test Verifiers
pytest and unittest verification
Execution Verifier
Multi-Language Verifier
Schema verifiers
JSON-structure / JSON-schema / regex-format verifiers for tool-calling, structured output, and format-discipline finetunes.
Reference-metric verifiers
BLEU / ROUGE / chrF — score candidates against a reference string with the standard MT / summarization metrics.
LLM-as-judge verifier
Rubric-graded verifier for outputs without a programmatic ground truth. Pluggable judge model — local or hosted.