Release Equivalents (External Benchmarks)
This note maps our current release protocol to practices used by established public benchmarks.
Observed External Patterns
lm-evaluation-harnessexposes bootstrap-based stderr reporting (bootstrap_iters) in evaluator code, so public claims can report uncertainty, not only point estimates.- Source: https://raw.githubusercontent.com/EleutherAI/lm-evaluation-harness/main/lm_eval/evaluator.py
- SWE-bench moved to a containerized harness for reproducible evaluations and documents Docker as the standard evaluation environment.
- Source: https://raw.githubusercontent.com/SWE-bench/SWE-bench/main/README.md
- SWE-bench keeps some test evaluation private (for leaderboard submission) and uses a curated verified subset for reliable comparisons.
- Source: https://raw.githubusercontent.com/SWE-bench/SWE-bench/main/README.md
- Source: https://www.swebench.com/verified.html
- SWE-bench-Live states monthly updates while preserving frozen
lite/verifiedsplits for stable leaderboard comparisons.- Source: https://swe-bench-live.github.io/
Current Mapping In This Addendum
- Deterministic split + contamination controls: implemented.
- Public-vs-full release visibility and heldout commitments: implemented.
- Run bundle checksums and explicit release checklist/failure policy: implemented.
- Pass@k protocol and multi-model suite runner: implemented.
- Infra-vs-model failure accounting and retry policy: implemented.
- Bootstrap confidence intervals on primary verification metrics: implemented.
- Per-repo replay profiles (policy overrides by repo/Lean version with strict matching mode): implemented.
Remaining Gap To Match Top Public Leaderboards
- Repository-context replay is first-pass (not yet full-project CI replay at leaderboard scale).
- No fully pinned container image including Lean toolchain and per-repo replay runtime matrix yet.
- Large-scale distributed execution and governance workflow for public submissions is not yet implemented.
Recommendation
- Keep current release mode as
betafor strict executable filtering and protocol validation. - Promote to public leaderboard mode only after adding:
- pinned replay runtime image (including Lean/toolchain expectations),
- larger heldout suite runs with stable model/runtime pinning,
- distributed replay execution + submission governance policy.