Public Benchmark Notes
Current findings as of 2026-03-15 from the addendum pipeline.
Use docs/approach.md for the method overview. This file is the detailed metric/reference companion.
Public Labels Match The Graph
- Public set size:
1200 - Public label agreement with
graph.json:1200/1200 - Public statuses are proof-only: every public status byte is in
{2, 3, 6, 7} - Public status mix:
explicit_proof_false:34explicit_proof_true:1implicit_proof_false:592implicit_proof_true:573
The public benchmark is not just adjacent to the Equational Theories implication graph. It is exactly recoverable from that graph.
Small Countermodels Explain Most FALSE Cases
Greedy cover over all 2-element magmas covers 454/626 public false problems with six operations:
constant_0(0000):172right_projection(0101):126left_projection(0011):98xor(0110):45not_left(1100):8not_right(1010):5
This is the strongest compact false-side signal.
If we shift from the old greedy witness battery to the full fixed 10-theory 2-element pair-evaluator basis, false coverage improves to 467/626. So the full representative basis adds 13 public false decisions beyond the old six-table battery and is the better offline evaluator surface.
Brute-force search over all 19683 binary operations on the 3-element carrier adds another 84 false problems from the 172 missed by the 2-element battery, leaving 88 still unexplained.
Most frequent 3-element witness tables so far:
000000001:10000002222:8000000010:6021210102:6
An explicit SAT search for 4-element countermodels, with a 1000 ms timeout per residual problem, adds another 51 false problems from the 88 missed after the 3-element pass. That leaves only 37 public false problems unexplained.
Unique-pair status split at size 4:
20sat16unsat1unknown
An explicit SAT search for 5-element countermodels, again with a 1000 ms timeout per residual problem, adds another 16 false problems from the 37 missed after the 4-element pass. That leaves only 21 public false problems unexplained.
Unique-pair status split at size 5:
7sat3unsat7unknown
Combined false-side coverage:
2-element battery:454/6263-element exhaustive search:84/172residual4-element SAT search:51/88residual5-element SAT search:16/37residual- total explained false cases:
605/626(96.6%)
Pushing the remaining 21 cases to a 3000 ms timeout at size 5 did not add any new witnesses, so the current bottleneck is not just per-instance timeout.
The expensive residual searches are now deduped by exact (equation1_id, equation2_id) pair:
- the
172false cases missed by the2-element battery collapse to100unique pairs over93unique sources - the
88cases missed after the3-element pass collapse to37unique pairs over33unique sources - the
37cases missed after the4-element pass collapse to17unique pairs over16unique sources - the final
21residual false cases are only10unique pairs over9unique sources
Canonicalizing the 7 size-5 unknown residual pairs to the smallest mutually implied source and target laws already helps, but less than the earlier unstable rerun suggested:
1286 -> 2891reduces to1286 -> 20792656 -> 2863reduces to2055 -> 28632805 -> 3891reduces to1587 -> 38912994 -> 2623reduces to2994 -> 242- rerunning size-
5SAT on those canonical pairs changes1case fromunknowntounsat - the canonical followup status split is now
6 unknown / 1 unsat
Two Full-Graph Source Families Give The Main TRUE Rules
Across the full 4694-law universe:
- laws of the form
x = twherexis absent fromt:816/816imply every target law - laws of the form
x = twhere every occurrence ofxis on a mixed path, some other variable appears exactly once int, and the x-path set does not contain both exactLRandRL:680/680imply every target law - laws of the form
x = twhere the leftmost leaf oftis stillx:0/816imply every target law - laws of the form
x = twherexoccurs exactly once and is not leftmost:504/924imply every target law
The first two lines are the cleanest true-side invariants in the current stack.
Theorem Rules Cover Most Public TRUE Cases
The theorem-backed TRUE side carries most of the public TRUE cases:
- collapse rule:
243/574public true problems - mixed-self-reference-with-singleton rule:
188/574 - direct substitution-instance rule:
5/574 - one-hole context closure over substitution:
6/574 - combined theorem-backed cover:
433/574(75.4%)
The remaining theorem-uncovered true region is 141 public problems.
Source-Row Semantics Explain Most Of The Remaining TRUE Mass
The main shift in this project is that the dominant middle is source-row semantics, not another thin source feature family.
Over the 10 distinct 2-element theory classes induced by the 16 binary operations:
2407/4694source laws have a full implication row that is exactly the intersection of the2-element theories satisfying that source- this is exact for
669/1200public problems - the exact public split is
558true and111false - among the
141theorem-uncovered public true cases, the exact source-row classifier explains126
The highest-yield exact source fingerprints on the public set are:
constant:92public problems,48true,47from the current theorem-uncovered true tailright_projection:39public problems,30true,30from the current theorem-uncovered true tailleft_projection:36public problems,29true,29from the current theorem-uncovered true tailnot_right + right_projection:32public problems,13true,13from the current theorem-uncovered true tailleft_projection + not_left:18public problems,6true,6from the current theorem-uncovered true tail- empty fingerprint is real but not automatically TRUE; it only works when paired with a proved collapse theorem
Raw Fingerprints Collide Too Much To Ship Directly
The raw 2-element source fingerprint is helpful, but it is not close to a unique identifier for the full implication row:
- there are
182distinct fingerprint buckets over the full4694-law universe - only
51buckets have a single row signature 131buckets are internally divergent- only
51laws live in single-row buckets; the other4643laws live in divergent buckets
The worst collisions are exactly where the current residual mass lives:
- empty fingerprint bucket:
1558laws split across1125distinct rows constant:671laws split across579rowsleft_projection:220laws split across199rowsright_projection:220laws split across195rowsxor_xnor:182laws split across179rows
The lesson is straightforward:
- exact source-row semantics is real and high-yield
- raw fingerprint labels alone are too coarse to be the final evaluator
- the remaining Stage 1 work is mostly about the non-exact collision region
The Proof Layer Is Now Explicit
The proof layer is now explicitly cataloged instead of graph-mined at review time.
After the exact public surface 1164/1200, the shipped constructive extension is:
- an exact source-triggered kernel catalog with
10explicit source patterns - two exact source-triggered commutativity repairs
The kernel catalog explains:
13/15current residual public true problems10/12current residual unique true pairs
That raises the constructive candidate surface from 1164/1200 to 1177/1200, leaving only:
2unresolved public true cases21unresolved public false cases
The two remaining true cases are closed by two explicit commutativity repairs:
711 -> 3567:- source trigger:
x = y * (y * ((x * z) * z)) - base kernel:
x * y = y * ((x * z) * z) - helper:
x * y = y * x - one local flip yields
x * y = y * ((z * x) * z)
- source trigger:
2170 -> 4640:- source trigger:
x = ((y * z) * x) * (z * y) - base kernel:
x * (x * y) = (y * z) * z - helper:
x * y = y * x - one local flip yields
(x * y) * x = (y * z) * z
- source trigger:
- the commutativity repairs cover
2/2residual true cases after the direct-kernel catalog - the constructive candidate surface therefore moves from
1177/1200to1179/1200
So the public true-side residual is closed by a finite constructive catalog. The remaining public tail is purely false-side.
The Remaining Tail Is Small And Structured
After theorem-backed TRUE families, exact source rows, and the explicit false witness stack, the exact public-decision tail is still 36 problems:
15unresolved true21unresolved false
The unresolved true cases before the kernel proof layer were concentrated in a few source fingerprints:
xor_xnor:7right_projection:3constant:3not_left_and_right_left_implies_right + right_projection:1left_projection + not_left:1
The unresolved false cases are even more concentrated:
- empty fingerprint:
9 xor_xnor:5right_projection:4left_projection:3
By unique pair, the false tail is only 10 source-target pairs over 9 unique source laws.
What The Full Stack Decides On The Public Set
If we stack:
- theorem-backed TRUE families
- exact
2-element source-row semantics - explicit false witnesses from the finite-model search stack
then the current public decision surface is:
1164/1200exact decisions15unresolved public true cases21unresolved public false cases
If we then add the current explicit source-triggered kernel catalog, the constructive surface moves to:
1177/1200candidate decisions2unresolved public true cases21unresolved public false cases
If we add the two explicit commutativity repairs on top of that catalog, the constructive candidate surface moves again to:
1179/1200candidate decisions0unresolved public true cases21unresolved public false cases
Public-Only Candidate Rules Still Stay Hypotheses
These are public-set patterns only. They are useful for triage, but they are not theorem-backed and should stay hypotheses:
- non-collapse TRUE candidate:
source_more_vars + target_more_ops + target_two_varsis17/17 - non-collapse FALSE candidate:
source_more_repetition + source_two_varsis0/116 - non-collapse FALSE candidate:
same_operation_count + source_two_varsis0/115
Full-graph validation says these public-perfect rules do not generalize cleanly:
source_more_vars + target_more_ops + target_two_varsdrops to0.3999true-rate on the full graphsource_more_repetition + source_two_varsstays extremely false-biased at0.0026true-ratesame_operation_count + source_two_varsstays extremely false-biased at0.0033true-rate
Frozen Submission Artifact
See cheatsheet-v0.txt.