Submission Policy (Beta)

This benchmark is open and self-serve. Anyone can run and publish results if they follow the protocol below.

Required Submission Artifacts

  • suite_results.json
  • suite_results.jsonl
  • suite_summary.md
  • suite_config.json (or equivalent config used for the run)
  • split_manifest.json
  • contamination_report.json
  • heldout_test_commitments.json
  • SHA256SUMS

Minimum Validity Conditions

  • suite_results.json must report all runs with status == "success".
  • contamination_report.json must satisfy:
    • fractions.leak_fraction <= config.max_leak_fraction.
  • SHA256SUMS must verify for all submitted files.
  • Any profile-based replay submission must include profile config and strict mode evidence:
    • verification.repo_replay.profile_config_path present
    • verification.repo_replay.profile_strict == true

Claim Levels

  • Harness claim:
    • run is valid and reproducible as submitted.
  • Clean baseline claim:
    • all harness claim conditions, and model_error_run_count == 0.

Disclosure Requirements

  • Model identifier(s) and endpoint provider.
  • Inference parameters used in run config.
  • Verification mode (none, synthetic, or repo_replay).
  • If repo_replay, include runtime manifest and profile config.

Disallowed

  • Mixing results from different split manifests under one leaderboard row.
  • Editing output artifacts after checksum generation.
  • Reporting replay results without profile config when strict mode is expected.

Maintainer Notes

  • This policy is intentionally lightweight during beta.
  • Public leaderboard governance can add stronger controls later (for example hosted submission verification or sealed heldout replay infrastructure).