5.4 Pro Review Packet

This packet is the minimal handoff for reviewing the current state of the equational-theories-distillation addendum.

Objective

Build the strongest possible plain-text cheatsheet for the SAIR Equational Theories Stage 1 competition.

The operating assumption is:

inference-time model gets no tools
submission is a plain-text cheatsheet under 10 KB
the public benchmark is a teacher surface, not the final target
the real goal is a compact decision procedure that generalizes beyond the public split

Ground Truth Surface

The addendum pins four source assets:

public normal.jsonl
public hard.jsonl
equations.txt
graph.json

The public benchmark maps exactly into the fixed 4694-law implication graph:

public label agreement with graph.json: 1200/1200

So the graph is the offline teacher surface.

Current System

Current pipeline:

map every public problem into the 4694-law universe
mine theorem-backed always-true source-law families from the full graph
evaluate each pair against a fixed 10-table 2-element basis
mine exact source-row semantics over the same 2-element theory classes
mine false implications by explicit finite countermodels
mine nontrivial short kernel bridges for the true-side residual
extend that into a tiny 2-kernel micro-rewrite layer
distill the surviving rules into a compact cheatsheet draft

False-side search stack:

fixed 10-table 2-element pair evaluator
greedy 6-operation 2-element witness battery
exhaustive 3-element brute force
residual SAT search at size 4
residual SAT search at size 5

True-side stack:

collapse family detection
mixed-self-reference-with-singleton family detection
direct substitution-instance matcher
one-hole contextual-instance matcher
nontrivial short-kernel bridge search
exact source-row semantics under the 2-element theory classes

Strongest Results

True side:

collapse family: source law x = t with x absent from t
- full graph: 816/816 imply every target law
- public true coverage: 243/574
mixed-self-reference-with-singleton family:
- source law has form x = t
- every occurrence of x in t lies on a mixed path whose root-to-leaf trace uses both L and R
- some other variable appears exactly once in t
- the x-path set does not contain both exact LR and RL
- full graph: 680/680 imply every target law
- public true coverage: 188/574
one-hole context closure: 6/574
combined theorem-backed true coverage: 433/574 (75.44%)
remaining theorem-uncovered public true cases: 141

Source-row semantics:

there are 10 distinct 2-element theory classes induced by the 16 binary operations
for 2407/4694 source laws, the full implication row is exactly the intersection of the satisfying 2-element theories
on the public set this is exact for 669/1200 problems
public exact split: 558 true, 111 false
among the 141 theorem-uncovered public true cases, exact source-row semantics explains 126
dominant exact public source fingerprints:
- constant: 92 problems, 48 true
- right_projection: 39 problems, 30 true
- left_projection: 36 problems, 29 true
- not_right + right_projection: 32 problems, 13 true
- left_projection + not_left: 18 problems, 6 true

Pair evaluator:

a fixed 10-representative 2-element basis separates 467/626 public false problems
this is 13 better than the old greedy six-table battery
the current cheatsheet draft now uses this full basis instead of source-bucket lookup as the main offline evaluator

Collision analysis:

there are 182 raw source-fingerprint buckets
only 51 are single-row buckets
131 buckets are internally divergent
only 51 laws live in single-row buckets; the other 4643 laws live in divergent buckets
worst collision buckets:
- empty fingerprint: 1558 laws, 1125 row signatures
- constant: 671 laws, 579 row signatures
- left_projection: 220 laws, 199 row signatures
- right_projection: 220 laws, 195 row signatures
- xor_xnor: 182 laws, 179 row signatures

False side:

10-table pair evaluator: 467/626
greedy 2-element battery: 454/626
exhaustive 3-element search: +84
size-4 SAT at 1000 ms: +51
size-5 SAT at 1000 ms: +16
total explained false cases: 605/626 (96.65%)
remaining unexplained public false cases: 21
increasing the unresolved size-5 cases from 1000 ms to 3000 ms added 0 new witnesses
residual size-4 pair statuses: 20 sat / 1 unknown / 16 unsat
residual size-5 pair statuses: 7 sat / 3 unsat / 7 unknown
residual search is now deduped by exact source-target pair
- the final 21 false cases are only 10 unique pairs over 9 unique sources

Kernel proof layer:

exclude the vacuous case where the kernel is just the target law itself
restrict kernels to operation count <= 4
nontrivial kernel bridges explain 13/15 current residual true problems
that covers 10/12 residual true unique pairs
teacher-side candidate surface becomes 1177/1200
a 2-kernel micro-rewrite layer closes both remaining true pairs:
- 711 -> 3567
- 2170 -> 4640
teacher-side candidate surface after the micro-rewrite layer: 1179/1200
remaining candidate tail after the proof layers: 21 false and 0 true

Combined public decision surface:

theorem-backed TRUE + exact source-row semantics + explicit false witnesses = 1164/1200
unresolved exact-decision tail: 15 true and 21 false
unresolved true fingerprints:
- xor_xnor: 7
- right_projection: 3
- constant: 3
- not_left_and_right_left_implies_right + right_projection: 1
- left_projection + not_left: 1
unresolved false fingerprints:
- empty fingerprint: 9
- xor_xnor: 5
- right_projection: 4
- left_projection: 3

False-tail followup:

size-5 residual statuses are 7 sat / 3 unsat / 7 unknown
canonicalizing the 7 size-5 unknown pairs to the smallest mutually implied source and target laws changes 1 of them from unknown to unsat
canonical followup statuses: 6 unknown / 1 unsat

Current Direction

The addendum is now centered on a compact pair evaluator:

theorem-backed all-target TRUE families first
then the fixed 10-table 2-element separator basis
then direct substitution/context checks
then a short-kernel proof layer with a tiny 2-kernel micro-rewrite extension
false-side search quality matters more than bigger timeout because the size-5 tail now has explicit unknown cases
source-row semantics remains the teacher-side lens, but raw fingerprint buckets are no longer treated as the final offline abstraction

Review Questions

Please focus on these:

Is the new pair-evaluator framing actually the right Stage 1 abstraction, or is there a stronger compact evaluator to distill?
Is the current short-kernel + 2-kernel micro-rewrite layer the right shallow proof-search mechanism, or should it be replaced by something cleaner?
What is the best next false-side move for the residual 21 cases given the 7 unknown size-5 statuses: better SAT encoding, symmetry breaking, a different solver, or a different small semantic basis?
Is canonicalization before model search enough for the current unknown tail, or should the solver surface itself change next?
Is the current cheatsheet draft actually aligned with the competition objective as an offline decision procedure, or is there a better distillation target?

Key Files

Start here:

README.md
docs/public-benchmark-notes.md
docs/cheatsheet-v0.txt
equational_theories_distillation/source_row_semantics.py
equational_theories_distillation/source_patterns.py
equational_theories_distillation/analysis.py
equational_theories_distillation/cheatsheet.py
tmp/equational-theories-distillation/analysis/public-analysis.json

Repro

cd addenda/equational-theories-distillation
uv sync
uv run ruff check .
uv run ty check
uv run python -m pytest
uv run python -m equational_theories_distillation analyze \
  --size4-sat-timeout-ms 1000 \
  --size5-sat-timeout-ms 1000
uv run python -m equational_theories_distillation draft-cheatsheet