Release Equivalents (External Benchmarks)

This note maps our current release protocol to practices used by established public benchmarks.

Observed External Patterns

Current Mapping In This Addendum

  • Deterministic split + contamination controls: implemented.
  • Public-vs-full release visibility and heldout commitments: implemented.
  • Run bundle checksums and explicit release checklist/failure policy: implemented.
  • protocol and multi-model suite runner: implemented.
  • Infra-vs-model failure accounting and retry policy: implemented.
  • Bootstrap confidence intervals on primary verification metrics: implemented.
  • Per-repo replay profiles (policy overrides by repo/Lean version with strict matching mode): implemented.

Remaining Gap To Match Top Public Leaderboards

  • Repository-context replay is first-pass (not yet full-project CI replay at leaderboard scale).
  • No fully pinned container image including Lean toolchain and per-repo replay runtime matrix yet.
  • Large-scale distributed execution and governance workflow for public submissions is not yet implemented.

Recommendation

  • Keep current release mode as beta for strict executable filtering and protocol validation.
  • Promote to public leaderboard mode only after adding:
    • pinned replay runtime image (including Lean/toolchain expectations),
    • larger heldout suite runs with stable model/runtime pinning,
    • distributed replay execution + submission governance policy.