Public Benchmark Notes

Current findings as of 2026-03-15 from the addendum pipeline.

Verified Ground Truth Surface

  • Public set size: 1200
  • Public label agreement with graph.json: 1200/1200
  • Public statuses are proof-only: every public status byte is in {2, 3, 6, 7}
  • Public status mix:
    • explicit_proof_false: 34
    • explicit_proof_true: 1
    • implicit_proof_false: 592
    • implicit_proof_true: 573

The public benchmark is therefore not just adjacent to the Equational Theories implication graph. It is exactly recoverable from that graph.

Small Countermodel Coverage

Greedy cover over all 2-element magmas covers 454/626 public false problems with six operations:

  • constant_0 (0000): 172
  • right_projection (0101): 126
  • left_projection (0011): 98
  • xor (0110): 45
  • not_left (1100): 8
  • not_right (1010): 5

This is the strongest current false-side signal.

If we shift from the old greedy witness battery to the full fixed 10-theory 2-element pair-evaluator basis, false coverage improves to 467/626. So the full representative basis adds 13 public false decisions beyond the old six-table battery and is the better offline evaluator surface.

Brute-force search over all 19683 binary operations on the 3-element carrier adds another 84 false problems from the 172 missed by the 2-element battery, leaving 88 still unexplained.

Most frequent 3-element witness tables so far:

  • 000000001: 10
  • 000002222: 8
  • 000000010: 6
  • 021210102: 6

An explicit SAT search for 4-element countermodels, with a 1000 ms timeout per residual problem, adds another 51 false problems from the 88 missed after the 3-element pass. That leaves only 37 public false problems unexplained.

Unique-pair status split at size 4:

  • 20 sat
  • 1 unknown
  • 16 unsat

An explicit SAT search for 5-element countermodels, again with a 1000 ms timeout per residual problem, adds another 16 false problems from the 37 missed after the 4-element pass. That leaves only 21 public false problems unexplained.

Unique-pair status split at size 5:

  • 7 sat
  • 3 unsat
  • 7 unknown

Combined false-side coverage:

  • 2-element battery: 454/626
  • 3-element exhaustive search: 84/172 residual
  • 4-element SAT search: 51/88 residual
  • 5-element SAT search: 16/37 residual
  • total explained false cases: 605/626 (96.6%)

Pushing the remaining 21 cases to a 3000 ms timeout at size 5 did not add any new witnesses, so the current bottleneck is not just per-instance timeout.

The expensive residual searches are now deduped by exact (equation1_id, equation2_id) pair:

  • the 172 false cases missed by the 2-element battery collapse to 100 unique pairs over 93 unique sources
  • the 88 cases missed after the 3-element pass collapse to 37 unique pairs over 33 unique sources
  • the 37 cases missed after the 4-element pass collapse to 17 unique pairs over 16 unique sources
  • the final 21 residual false cases are only 10 unique pairs over 9 unique sources

Canonicalizing the 7 size-5 unknown residual pairs to the smallest mutually implied source and target laws already helps:

  • 1286 -> 2891 reduces to 1286 -> 2079
  • 2656 -> 2863 reduces to 2055 -> 2863
  • 2805 -> 3891 reduces to 1587 -> 3891
  • 2994 -> 2623 reduces to 2994 -> 242
  • rerunning size-5 SAT on those canonical pairs changes 1 case from unknown to unsat
  • the canonical followup status split is now 6 unknown / 1 unsat

Full-Graph Source-Law Invariants

Across the full 4694-law universe:

  • laws of the form x = t where x is absent from t: 816/816 imply every target law
  • laws of the form x = t where every occurrence of x is on a mixed path, some other variable appears exactly once in t, and the x-path set does not contain both exact LR and RL: 680/680 imply every target law
  • laws of the form x = t where the leftmost leaf of t is still x: 0/816 imply every target law
  • laws of the form x = t where x occurs exactly once and is not leftmost: 504/924 imply every target law

The first two lines are the cleanest true-side invariants discovered so far.

TRUE-Side Reality Check

The theorem-backed TRUE side is materially better than it was:

  • collapse rule: 243/574 public true problems
  • mixed-self-reference-with-singleton rule: 188/574
  • direct substitution-instance rule: 5/574
  • one-hole context closure over substitution: 6/574
  • combined theorem-backed cover: 433/574 (75.4%)

The remaining theorem-uncovered true region is down to 141 public problems.

Exact Source-Row Semantics

The biggest current shift is that the dominant middle is source-row semantics, not another thin source feature family.

Over the 10 distinct 2-element theory classes induced by the 16 binary operations:

  • 2407/4694 source laws have a full implication row that is exactly the intersection of the 2-element theories satisfying that source
  • this is exact for 669/1200 public problems
  • the exact public split is 558 true and 111 false
  • among the 141 theorem-uncovered public true cases, the exact source-row classifier explains 126

The highest-yield exact source fingerprints on the public set are:

  • constant: 92 public problems, 48 true, 47 from the current theorem-uncovered true tail
  • right_projection: 39 public problems, 30 true, 30 from the current theorem-uncovered true tail
  • left_projection: 36 public problems, 29 true, 29 from the current theorem-uncovered true tail
  • not_right + right_projection: 32 public problems, 13 true, 13 from the current theorem-uncovered true tail
  • left_projection + not_left: 18 public problems, 6 true, 6 from the current theorem-uncovered true tail
  • empty fingerprint is real but not automatically TRUE; it only works when paired with a proved collapse theorem

Fingerprint Collision Reality Check

The raw 2-element source fingerprint is helpful, but it is not close to a unique identifier for the full implication row:

  • there are 182 distinct fingerprint buckets over the full 4694-law universe
  • only 51 buckets have a single row signature
  • 131 buckets are internally divergent
  • only 51 laws live in single-row buckets; the other 4643 laws live in divergent buckets

The worst collisions are exactly where the current residual mass lives:

  • empty fingerprint bucket: 1558 laws split across 1125 distinct rows
  • constant: 671 laws split across 579 rows
  • left_projection: 220 laws split across 199 rows
  • right_projection: 220 laws split across 195 rows
  • xor_xnor: 182 laws split across 179 rows

So the current lesson is precise:

  • exact source-row semantics is real and high-yield
  • raw fingerprint labels alone are too coarse to be the final evaluator
  • the remaining Stage 1 work is mostly about the non-exact collision region

Kernel Proof Layer

The next true-side move is now much clearer than it was before the collision analysis.

If we exclude the vacuous case where the “kernel” is just the target law itself, and restrict to intermediate laws with operation count <= 4, then a nontrivial kernel bridge already explains:

  • 13/15 current residual public true problems
  • 10/12 current residual unique true pairs

That raises the teacher-side candidate surface from 1164/1200 to 1177/1200, leaving only:

  • 2 unresolved public true cases
  • 21 unresolved public false cases

The direct-kernel misses were:

  • 711 -> 3567: x = y * (y * ((x * z) * z)) does imply x * y = y * ((z * x) * z), but not yet through the current nontrivial kernel library
  • 2170 -> 4640: x = ((y * z) * x) * (z * y) does imply (x * y) * x = (y * z) * z, but again not yet through the current kernel layer

Those two gaps are now closed by a tiny 2-kernel micro-rewrite layer:

  • 711 -> 3567 closes through a short base kernel like 3553 plus helper 43 (x * y = y * x)
  • 2170 -> 4640 closes through a short base kernel like 4413 or 4450 plus the same helper 43
  • the micro-rewrite layer covers 2/2 residual true cases after the direct-kernel pass
  • the teacher-side candidate surface therefore moves from 1177/1200 to 1179/1200

So the true-side residual is no longer open on the public set. The remaining public tail is now purely false-side.

Residual Shape

After theorem-backed TRUE families, exact source rows, and the explicit false witness stack, the exact public-decision tail is still 36 problems:

  • 15 unresolved true
  • 21 unresolved false

The unresolved true cases before the kernel proof layer were concentrated in a few source fingerprints:

  • xor_xnor: 7
  • right_projection: 3
  • constant: 3
  • not_left_and_right_left_implies_right + right_projection: 1
  • left_projection + not_left: 1

The unresolved false cases are even more concentrated:

  • empty fingerprint: 9
  • xor_xnor: 5
  • right_projection: 4
  • left_projection: 3

By unique pair, the false tail is only 10 source-target pairs over 9 unique source laws.

Combined Public Decision Surface

If we stack:

  • theorem-backed TRUE families
  • exact 2-element source-row semantics
  • explicit false witnesses from the finite-model search stack

then the current public decision surface is:

  • 1164/1200 exact decisions
  • 15 unresolved public true cases
  • 21 unresolved public false cases

If we then add the current nontrivial kernel-bridge candidate layer, the teacher-side surface moves to:

  • 1177/1200 candidate decisions
  • 2 unresolved public true cases
  • 21 unresolved public false cases

If we add the current 2-kernel micro-rewrite layer on top of the direct-kernel pass, the teacher-side candidate surface moves again to:

  • 1179/1200 candidate decisions
  • 0 unresolved public true cases
  • 21 unresolved public false cases

Public-Only Candidate Rules

These are public-set patterns only. They are useful for triage, but they are not theorem-backed yet and should be treated as hypotheses:

  • non-collapse TRUE candidate: source_more_vars + target_more_ops + target_two_vars is 17/17
  • non-collapse FALSE candidate: source_more_repetition + source_two_vars is 0/116
  • non-collapse FALSE candidate: same_operation_count + source_two_vars is 0/115

Full-graph validation says these public-perfect rules do not generalize cleanly:

  • source_more_vars + target_more_ops + target_two_vars drops to 0.3999 true-rate on the full graph
  • source_more_repetition + source_two_vars stays extremely false-biased at 0.0026 true-rate
  • same_operation_count + source_two_vars stays extremely false-biased at 0.0033 true-rate

Current Submission Candidate

See cheatsheet-v0.txt.