Release Equivalents (External Benchmarks)

This note maps our current release protocol to practices used by established public benchmarks.

Observed External Patterns

lm-evaluation-harness exposes bootstrap-based stderr reporting (bootstrap_iters) in evaluator code, so public claims can report uncertainty, not only point estimates.
- Source: https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/lm_eval/evaluator.py
SWE-bench moved to a containerized harness for reproducible evaluations and documents Docker as the standard evaluation environment.
- Source: https://raw.githubusercontent.com/SWE-bench/SWE-bench/main/README.md
SWE-bench keeps some test evaluation private (for leaderboard submission) and uses a curated verified subset for reliable comparisons.
- Source: https://raw.githubusercontent.com/SWE-bench/SWE-bench/main/README.md
- Source: https://www.swebench.com/verified.html
SWE-bench-Live states monthly updates while preserving frozen lite/verified splits for stable leaderboard comparisons.
- Source: https://swe-bench-live.github.io/

Deterministic split + contamination controls: implemented.
Public-vs-full release visibility and heldout commitments: implemented.
Run bundle checksums and explicit release checklist/failure policy: implemented.
Pass@k protocol and multi-model suite runner: implemented.
Infra-vs-model failure accounting and retry policy: implemented.
Bootstrap confidence intervals on primary verification metrics: implemented.
Per-repo replay profiles (policy overrides by repo/Lean version with strict matching mode): implemented.

Repository-context replay is first-pass (not yet full-project CI replay at leaderboard scale).
No fully pinned container image including Lean toolchain and per-repo replay runtime matrix yet.
Large-scale distributed execution and governance workflow for public submissions is not yet implemented.

Keep current release mode as beta for strict executable filtering and protocol validation.
Promote to public leaderboard mode only after adding:
- pinned replay runtime image (including Lean/toolchain expectations),
- larger heldout suite runs with stable model/runtime pinning,
- distributed replay execution + submission governance policy.