Verifiers

Pluggable verification system for RLVR training. Plugin registry, programmatic + schema + reference-metric + LLM-as-judge verifiers.

Verifiers are the heart of RLVR — they provide the reward signal that guides training.

Important: Verifiers are training infrastructure, not benchmarks. For benchmark reporting (comparing to papers), see Evaluation.

Plugin registry

Halo-forge ships a plugin registry for verifiers. Three coexisting registration paths funnel into one dict:

# 1. Decorator (programmatic)
from halo_forge.rlvr.verifiers import register_verifier, Verifier, VerifyResult

@register_verifier("my_check")
class MyVerifier(Verifier):
    def verify(self, code: str) -> VerifyResult:
        ...

# 2. Plugin directory
#    ~/.halo-forge/verifiers/my_check.py
#    Auto-discovered on first registry access. Override path with
#    HALOFORGE_VERIFIERS_DIR.

# 3. Entry-point packages (pip-installable plugins)
#    [project.entry-points."halo_forge.verifiers"]
#    my_check = "my_pkg.verifiers:MyVerifier"

The trainer / CLI / public-API consumers call get_verifier(name) instead of importing concrete classes:

from halo_forge.rlvr.verifiers import get_verifier
verifier_cls = get_verifier("my_check")
verifier = verifier_cls(...)

Use list_registered_verifiers() to see every name in the registry.

Built-in short names

Pass any of these as --verifier <name> to GRPO / RAFT / data synthesize:

Category	Names
Code execution + compile	`gcc`, `clang`, `mingw`, `execution`, `gcc_execution`, `mingw_execution`, `clang_execution`, `pytest`, `unittest`, `rlvr_pytest`, `humaneval`, `mbpp`, `rust`, `cargo`, `go`, `custom`, `subprocess`
Schema + format	`json_structure`, `json_schema`, `regex_format`
Reference metrics	`bleu`, `rouge`, `chrf`
LLM-as-judge	`llm_judge`

Verifiers vs Benchmarks

Verifiers and benchmarks serve different purposes:

Aspect	Verifiers (Training)	Benchmarks (Reporting)
Purpose	Provide reward signal for RAFT	Compare model to published results
Output	Graduated rewards (0.0 to 1.0)	Metrics (pass@k, accuracy, WER)
When Used	During training loop	After training is complete
Tooling	Native halo-forge verifiers	Community tools (VLMEvalKit, etc.)

Use verifiers when: Running RAFT training, need graduated feedback, debugging training

Use benchmarks when: Evaluating trained model, comparing to papers, publishing results

Built-in Verifiers

Compilation Verifiers

Verifier	Language	Target	Compile	Run	Cross-Compile
`GCCVerifier`	C/C++	Linux ELF	Yes	Yes	-
`ClangVerifier`	C/C++	Linux ELF	Yes	Yes	-
`MinGWVerifier`	C/C++	Windows PE	Yes	No	-
`RemoteMSVCVerifier`	C/C++	Windows PE	Yes	Yes	Requires Windows server
`RustVerifier`	Rust	Native/Windows	Yes	Yes	x86_64-pc-windows-gnu
`GoVerifier`	Go	Native/Windows	Yes	Yes	GOOS=windows
`DotNetVerifier`	C#	Windows PE	Yes	No	win-x64
`PowerShellVerifier`	PowerShell	Script	Syntax	No	-

Test Verifiers

Verifier	Language	Use Case
`PytestVerifier`	Python	Code with tests
`UnittestVerifier`	Python	unittest format
`HumanEvalVerifier`	Python	HumanEval benchmark
`MBPPVerifier`	Python	MBPP benchmark
`SubprocessVerifier`	Any	Custom commands

Basic Usage

from halo_forge.rlvr.verifiers import GCCVerifier

verifier = GCCVerifier()
result = verifier.verify(code)

print(result.success)   # True/False
print(result.reward)    # 0.0 - 1.0
print(result.details)   # Human-readable message

Graduated Rewards

Binary rewards create sparse gradients. halo forge uses graduated rewards:

Outcome	Reward	Signal
Syntax error	0.0	Completely wrong
Compiles with warnings	0.3	Close but imperfect
Compiles clean	0.5	Correct syntax
Runs without crash	0.7	Executable
Correct output	1.0	Fully correct

from halo_forge.rlvr.verifiers import RewardLevel

# Get reward from compile result
reward = RewardLevel.from_compile_result(success=True, has_warnings=False)
# Returns 0.5

# Get reward from execution result
reward = RewardLevel.from_execution_result(
    compiles=True, 
    runs=True, 
    correct=False
)
# Returns 0.7

Batch Verification

Verify multiple samples in parallel:

verifier = GCCVerifier(max_workers=8)
codes = [code1, code2, code3, ...]

results = verifier.verify_batch(codes)  # Parallel execution

for result in results:
    print(f"{result.reward}: {result.details}")

With RAFT Training

from halo_forge.rlvr import RAFTTrainer
from halo_forge.rlvr.verifiers import GCCVerifier

verifier = GCCVerifier(max_workers=8)

trainer = RAFTTrainer(
    verifier=verifier,
    sft_checkpoint="models/sft/final_model"
)

trainer.run(prompts, num_cycles=5)

Verifier Architecture

                         Verifier (base class)
                                │
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
  CompileVerifier          TestVerifier           CustomVerifier
        │                       │
   ┌────┼────┬────┬────┐   ┌────┴────┐
   │    │    │    │    │   │         │
  GCC MinGW Clang Rust Go Pytest  Unittest
              │
         RemoteMSVC
         DotNet
         PowerShell

Chaining Verifiers

Run multiple verification stages:

from halo_forge.rlvr.verifiers import ChainedVerifier, GCCVerifier

verifier = ChainedVerifier([
    GCCVerifier(),                        # Stage 1: Compile
    GCCVerifier(run_after_compile=True),  # Stage 2: Run
])

result = verifier.verify(code)
# Stops at first failure, accumulates rewards

Cleanup

Always cleanup resources:

verifier = GCCVerifier()

try:
    results = verifier.verify_batch(codes)
finally:
    verifier.cleanup()

# Or use context manager
with GCCVerifier() as verifier:
    results = verifier.verify_batch(codes)

Verifier Safety

Verifiers execute model-generated code. Treat verifier runs as untrusted execution.

Current Safety Characteristics

Verifier Type	Safety Notes
`SubprocessVerifier`	Uses argv execution with `shell=False`
Python verifiers	Execute without OS-level resource limits
C/C++ verifiers	Apply `setrlimit` for CPU and memory during execution

Operational Guardrails (Recommended)

Isolate execution: Run verifiers in VMs or containers, avoid host mounts
Non-privileged user: Do not run verifiers as root
Constrain resources: Use OS limits or container quotas where possible
Restrict network: Prefer offline execution or blocked egress

These are operational recommendations, not enforced by halo-forge itself.

Compile Verifiers

Multi-language compilation verification

Custom Verifiers

Create your own verification logic

Test Verifiers

pytest and unittest verification

Execution Verifier

Multi-Language Verifier

Schema verifiers

JSON-structure / JSON-schema / regex-format verifiers for tool-calling, structured output, and format-discipline finetunes.

Reference-metric verifiers

BLEU / ROUGE / chrF — score candidates against a reference string with the standard MT / summarization metrics.

LLM-as-judge verifier

Rubric-graded verifier for outputs without a programmatic ground truth. Pluggable judge model — local or hosted.