Submission Policy (Beta)
This benchmark is open and self-serve. Anyone can run and publish results if they follow the protocol below.
Required Submission Artifacts
suite_results.jsonsuite_results.jsonlsuite_summary.mdsuite_config.json(or equivalent config used for the run)split_manifest.jsoncontamination_report.jsonheldout_test_commitments.jsonSHA256SUMS
Minimum Validity Conditions
suite_results.jsonmust report all runs withstatus == "success".contamination_report.jsonmust satisfy:fractions.leak_fraction <= config.max_leak_fraction.
SHA256SUMSmust verify for all submitted files.- Any profile-based replay submission must include profile config and strict mode evidence:
verification.repo_replay.profile_config_pathpresentverification.repo_replay.profile_strict == true
Claim Levels
- Harness claim:
- run is valid and reproducible as submitted.
- Clean baseline claim:
- all harness claim conditions, and
model_error_run_count == 0.
- all harness claim conditions, and
Disclosure Requirements
- Model identifier(s) and endpoint provider.
- Inference parameters used in run config.
- Verification mode (
none,synthetic, orrepo_replay). - If
repo_replay, include runtime manifest and profile config.
Disallowed
- Mixing results from different split manifests under one leaderboard row.
- Editing output artifacts after checksum generation.
- Reporting replay results without profile config when strict mode is expected.
Maintainer Notes
- This policy is intentionally lightweight during beta.
- Public leaderboard governance can add stronger controls later (for example hosted submission verification or sealed heldout replay infrastructure).