Data Card (Draft): Lean Sorry Benchmark Index
Dataset Surface
- Primary input:
index.jsonl - Expected companion:
index.jsonl.manifest.json(upstream provenance metadata) - This addendum consumes index artifacts only; it does not crawl repositories directly.
Required Row Fields
Each row must provide:
item_idrepo_remoterepo_commitrepo_lean_version(nullable)location_pathlocation_start_line,location_start_columnlocation_end_line,location_end_columngoal_sha256(nullable)goal_textsource_url
Rows fail fast on missing/invalid fields. Missing goal_text is a hard error.
Derived Fields And Normalization
goal_bucketis derived in-code:core_easywhen the goal text passes strict heuristic checks.fullotherwise.
- Rows are sorted by
item_idbefore selection and hashing.
Split And Contamination Controls
Frozen split generation (split_artifacts_cli) uses:
- Repo holdout by deterministic hash of
(seed, repo_remote). - Exact contamination check by
goal_sha256. - Near-duplicate contamination check by token Jaccard on normalized
goal_text. - Additional near-duplicate signal by character n-gram Jaccard.
- Automatic drop of contaminated heldout rows with explicit accounting:
dropped_test_item_idsleak_fraction- overlap pair listings
- Release controls:
license_policy:anyoropen_onlyrelease_visibility:fullorpublic
Release Pinning Requirements
For a releasable data/baseline bundle, pin:
public_dev.jsonlheldout_test.jsonl(forrelease_visibility=fullonly)heldout_test_commitments.json(required for bothfullandpublic)split_manifest.jsoncontamination_report.jsonartifact_checksums.json- SHA-256 checksums for each pinned file
- annotated git tags for the split release and baseline run release
split_manifest.json and contamination_report.json must include:
- source
index_sha256 - split config (
seed,repo_holdout_fraction,near_dup_jaccard_threshold,char_ngram_jaccard_threshold,license_policy,max_leak_fraction) - release visibility metadata (
fullvspublic)
For public distribution, prefer license_policy=open_only and release_visibility=public.