Verifiers

Pluggable verification system for RLVR training. Plugin registry, programmatic + schema + reference-metric + LLM-as-judge verifiers.

Verifiers are the heart of RLVR — they provide the reward signal that guides training.

Important: Verifiers are training infrastructure, not benchmarks. For benchmark reporting (comparing to papers), see Evaluation.

Plugin registry

Halo-forge ships a plugin registry for verifiers. Three coexisting registration paths funnel into one dict:

# 1. Decorator (programmatic)
from halo_forge.rlvr.verifiers import register_verifier, Verifier, VerifyResult

@register_verifier("my_check")
class MyVerifier(Verifier):
    def verify(self, code: str) -> VerifyResult:
        ...

# 2. Plugin directory
#    ~/.halo-forge/verifiers/my_check.py
#    Auto-discovered on first registry access. Override path with
#    HALOFORGE_VERIFIERS_DIR.

# 3. Entry-point packages (pip-installable plugins)
#    [project.entry-points."halo_forge.verifiers"]
#    my_check = "my_pkg.verifiers:MyVerifier"

The trainer / CLI / public-API consumers call get_verifier(name) instead of importing concrete classes:

from halo_forge.rlvr.verifiers import get_verifier
verifier_cls = get_verifier("my_check")
verifier = verifier_cls(...)

Use list_registered_verifiers() to see every name in the registry.

Built-in short names

Pass any of these as --verifier <name> to GRPO / RAFT / data synthesize:

CategoryNames
Code execution + compilegcc, clang, mingw, execution, gcc_execution, mingw_execution, clang_execution, pytest, unittest, rlvr_pytest, humaneval, mbpp, rust, cargo, go, custom, subprocess
Schema + formatjson_structure, json_schema, regex_format
Reference metricsbleu, rouge, chrf
LLM-as-judgellm_judge

Verifiers vs Benchmarks

Verifiers and benchmarks serve different purposes:

AspectVerifiers (Training)Benchmarks (Reporting)
PurposeProvide reward signal for RAFTCompare model to published results
OutputGraduated rewards (0.0 to 1.0)Metrics (pass@k, accuracy, WER)
When UsedDuring training loopAfter training is complete
ToolingNative halo-forge verifiersCommunity tools (VLMEvalKit, etc.)

Use verifiers when: Running RAFT training, need graduated feedback, debugging training

Use benchmarks when: Evaluating trained model, comparing to papers, publishing results


Built-in Verifiers

Compilation Verifiers

VerifierLanguageTargetCompileRunCross-Compile
GCCVerifierC/C++Linux ELFYesYes-
ClangVerifierC/C++Linux ELFYesYes-
MinGWVerifierC/C++Windows PEYesNo-
RemoteMSVCVerifierC/C++Windows PEYesYesRequires Windows server
RustVerifierRustNative/WindowsYesYesx86_64-pc-windows-gnu
GoVerifierGoNative/WindowsYesYesGOOS=windows
DotNetVerifierC#Windows PEYesNowin-x64
PowerShellVerifierPowerShellScriptSyntaxNo-

Test Verifiers

VerifierLanguageUse Case
PytestVerifierPythonCode with tests
UnittestVerifierPythonunittest format
HumanEvalVerifierPythonHumanEval benchmark
MBPPVerifierPythonMBPP benchmark
SubprocessVerifierAnyCustom commands

Basic Usage

from halo_forge.rlvr.verifiers import GCCVerifier

verifier = GCCVerifier()
result = verifier.verify(code)

print(result.success)   # True/False
print(result.reward)    # 0.0 - 1.0
print(result.details)   # Human-readable message

Graduated Rewards

Binary rewards create sparse gradients. halo forge uses graduated rewards:

OutcomeRewardSignal
Syntax error0.0Completely wrong
Compiles with warnings0.3Close but imperfect
Compiles clean0.5Correct syntax
Runs without crash0.7Executable
Correct output1.0Fully correct
from halo_forge.rlvr.verifiers import RewardLevel

# Get reward from compile result
reward = RewardLevel.from_compile_result(success=True, has_warnings=False)
# Returns 0.5

# Get reward from execution result
reward = RewardLevel.from_execution_result(
    compiles=True, 
    runs=True, 
    correct=False
)
# Returns 0.7

Batch Verification

Verify multiple samples in parallel:

verifier = GCCVerifier(max_workers=8)
codes = [code1, code2, code3, ...]

results = verifier.verify_batch(codes)  # Parallel execution

for result in results:
    print(f"{result.reward}: {result.details}")

With RAFT Training

from halo_forge.rlvr import RAFTTrainer
from halo_forge.rlvr.verifiers import GCCVerifier

verifier = GCCVerifier(max_workers=8)

trainer = RAFTTrainer(
    verifier=verifier,
    sft_checkpoint="models/sft/final_model"
)

trainer.run(prompts, num_cycles=5)

Verifier Architecture

                         Verifier (base class)
                                │
        ┌───────────────────────┼───────────────────────┐
        │                       │                       │
  CompileVerifier          TestVerifier           CustomVerifier
        │                       │
   ┌────┼────┬────┬────┐   ┌────┴────┐
   │    │    │    │    │   │         │
  GCC MinGW Clang Rust Go Pytest  Unittest
              │
         RemoteMSVC
         DotNet
         PowerShell

Chaining Verifiers

Run multiple verification stages:

from halo_forge.rlvr.verifiers import ChainedVerifier, GCCVerifier

verifier = ChainedVerifier([
    GCCVerifier(),                        # Stage 1: Compile
    GCCVerifier(run_after_compile=True),  # Stage 2: Run
])

result = verifier.verify(code)
# Stops at first failure, accumulates rewards

Cleanup

Always cleanup resources:

verifier = GCCVerifier()

try:
    results = verifier.verify_batch(codes)
finally:
    verifier.cleanup()

# Or use context manager
with GCCVerifier() as verifier:
    results = verifier.verify_batch(codes)

Verifier Safety

Verifiers execute model-generated code. Treat verifier runs as untrusted execution.

Current Safety Characteristics

Verifier TypeSafety Notes
SubprocessVerifierUses argv execution with shell=False
Python verifiersExecute without OS-level resource limits
C/C++ verifiersApply setrlimit for CPU and memory during execution
  1. Isolate execution: Run verifiers in VMs or containers, avoid host mounts
  2. Non-privileged user: Do not run verifiers as root
  3. Constrain resources: Use OS limits or container quotas where possible
  4. Restrict network: Prefer offline execution or blocked egress

These are operational recommendations, not enforced by halo-forge itself.