Benchmark Card (Draft): goal_to_tactic_proposal_v1
Identity
- Addendum:
addenda/lean-sorry-repos-benchmark - Entrypoint:
python -m lean_sorry_repos_benchmark - Schema version:
1 - Task: given a Lean goal state, propose exactly one next tactic line.
Scope
- Input unit: one row from
index.jsonl. - Output unit: one attempt record per
(item_id, sample_index). - Adapters:
mock(deterministic local baseline),ollama(local inference endpoint),openai(OpenAI-compatible chat completions endpoint). - Verification modes:
none,synthetic,repo_replay.
Protocol
- Row loading validates required fields and fails on malformed rows.
- Selection is deterministic from
split_policy,goal_slice,seed, and optionalmax_items. - Multi-sample runs use deterministic per-attempt seeds from
(seed, item_id, sample_index). - Verification only runs when generation had no error and tactic is non-empty and does not contain
sorry/admit. repo_replaycan load per-repo profile config (schema_version: 1) to override replay policy byrepo_remoteand/orrepo_lean_version.
Metrics And Outputs
- Proposal metrics (
summary.json):valid_rate,nonempty_rate,contains_sorry_rate- latency (
latency_ms_mean,latency_ms_p50,latency_ms_p95) - generation error taxonomy (
generation_error_kinds,generation_error_domains)
- Verification metrics (
summary.json):- attempted/success/error rates and latency
verification_pass_at_k_*using item-level success aggregation- verification error taxonomy (
verification_error_kinds,verification_error_domains)
- Output files:
attempts.jsonlsummary.json
Reproducibility Invariants
selected_rows_hashpins the selected item set.- Optional deterministic sharding (
--shard-count,--shard-index) partitions selected items by stableitem_idhashing. - Replay profile provenance is explicit in artifacts (
profile_config_path,profile_match_counts, and per-attemptrepo_replay_profile_id). - Pass@k values are bounded to
samples_per_item. - Frozen split generation writes
split_manifest.jsonandcontamination_report.jsonwith source index SHA-256 and split config.
Known Limits
- Tactic extraction keeps only the first tactic line from model output.
syntheticverification checks reconstructed scripts fromgoal_text; it is not equivalent to repo-context replay.repo_replayverifies patched-file Lean execution atrepo@commit; it is first-pass and not full-project CI orchestration.