Release Equivalents (External Benchmarks)
This note maps our current release protocol to practices used by established public benchmarks.
Observed External Patterns
lm-evaluation-harnessexposes bootstrap-based stderr reporting (bootstrap_iters) in evaluator code, so public claims can report uncertainty, not only point estimates.- SWE-bench moved to a containerized harness for reproducible evaluations and documents Docker as the standard evaluation environment.
- SWE-bench keeps some test evaluation private (for leaderboard submission) and uses a curated verified subset for reliable comparisons.
- SWE-bench-Live states monthly updates while preserving frozen
lite/verifiedsplits for stable leaderboard comparisons.
Current Mapping In This Addendum
- Deterministic split + contamination controls: implemented.
- Public-vs-full release visibility and heldout commitments: implemented.
- Run bundle checksums and explicit release checklist/failure policy: implemented.
- Pass@k protocol and multi-model suite runner: implemented.
- Infra-vs-model failure accounting and retry policy: implemented.
- Bootstrap confidence intervals on primary verification metrics: implemented.
- Per-repo replay profiles (policy overrides by repo/Lean version with strict matching mode): implemented.
Remaining Gap To Match Top Public Leaderboards
- Repository-context replay is first-pass (not yet full-project CI replay at leaderboard scale).
- No fully pinned container image including Lean toolchain and per-repo replay runtime matrix yet.
- Large-scale distributed execution and governance workflow for public submissions is not yet implemented.
Recommendation
- Keep current release mode as
betafor strict executable filtering and protocol validation. - Promote to public leaderboard mode only after adding:
- pinned replay runtime image (including Lean/toolchain expectations),
- larger heldout suite runs with stable model/runtime pinning,
- distributed replay execution + submission governance policy.