Reference-metric verifiers

BLEU / ROUGE / chrF — score candidates against a reference string with the standard MT / summarization metrics.

Wraps the standard MT / summarization scoring metrics behind the halo-forge Verifier interface. The trainer scores candidates against a reference string without writing the bridge per-task.

All three normalize their score into reward ∈ [0.0, 1.0] so RAFT / GRPO can mix them with execution-based rewards in a chained verifier without rescaling math.

bleu

SacreBLEU — the standard MT score, comparable across papers because tokenization is fixed.

from halo_forge.rlvr.verifiers import get_verifier

v = get_verifier("bleu")(
    references="The quick brown fox jumps over the lazy dog.",
    success_threshold=0.3,  # BLEU/100 above which success=True
)
v.verify("The quick brown fox jumps over the dog.")

Multi-reference is supported (translation tasks where several valid translations exist):

v = get_verifier("bleu")(
    references=[
        "The quick brown fox.",
        "A quick brown fox.",
    ],
)

Requires pip install sacrebleu.

rouge

google-research’s rouge_score — the standard summarization metric. ROUGE-L F-measure by default; ROUGE-1 and ROUGE-2 also available.

v = get_verifier("rouge")(
    reference="A summary that captures the main points concisely.",
    rouge_type="rougeL",  # or "rouge1", "rouge2"
)
v.verify("Summary capturing main points concisely.")

Requires pip install rouge_score.

chrf

Character n-gram F-score via SacreBLEU. More robust than BLEU on morphologically rich languages and increasingly the recommended MT baseline.

v = get_verifier("chrf")(
    references="The quick brown fox jumps over the lazy dog.",
    word_order=2,  # chrF++ (default); 0 = plain chrF
)

Requires pip install sacrebleu (same package as BLEU).

When to pick which

Task	Recommended	Why
Translation	`chrf`	Modern MT baseline; robust to morphology
Summarization	`rouge`	Standard summarization metric
Paraphrase / generation	`bleu`	Most papers report BLEU; keep comparable
Strictly formatted output	`regex_format` (V3)	Reference-metric won’t catch structure
Tool calls / JSON	`json_schema` (V3)	Same — structure first

Composing

Metrics compose with execution-based verifiers via ChainedVerifier — useful for “code that compiles AND matches a reference solution”:

from halo_forge.rlvr.verifiers import (
    ChainedVerifier,
    GCCVerifier,
    BLEUVerifier,
)

v = ChainedVerifier([
    GCCVerifier(),                          # 0.5 if it compiles
    BLEUVerifier(references=ref_solution),  # 0.5 if it's close to the reference
])

Lazy imports

All three metric packages (sacrebleu, rouge_score) are lazy-imported. The verifier modules load on installs without those deps; the verifier’s first .verify() call surfaces a clean ImportError pointing at the right pip install command.

Empty candidates short-circuit before lazy import so they fail with error="empty_response" regardless of dep availability.