External Review Notes

Short notes from the local Claude CLI and Gemini CLI review passes.

2026-03-15: Micro-Rewrite + Canonical False-Tail Pass

State reviewed:

exact public decision surface stayed 1164/1200
direct kernel bridge candidate stayed 1177/1200
the new 2-kernel micro-rewrite layer pushed the teacher-side candidate surface to 1179/1200
the remaining candidate tail is now 21 false and 0 true
canonicalizing the 7 size-5 unknown false pairs changed 1 to unsat, leaving 6 canonical unknowns

Claude summary:

no usable response was returned on this pass
I tried both the wrapper-based full prompt and a very short direct claude --print prompt
both were invoked subscription-only with API-key env vars unset
the local CLI stayed live with no text output from the non-interactive process

Gemini summary:

strongest part: the evaluator now gets high coverage from a compact stack instead of from one brittle trick
biggest risk: the remaining 6 canonical size-5 unknowns may force a much more expensive false-side search if the current solver surface does not improve
suggested next move: target those 6 canonical unknowns with better constrained SAT/SMT or targeted enumeration
judgment on the 2-kernel micro-rewrite layer: it looks like a compact offline abstraction, not just a teacher-only patch, because it yields short modular proof traces instead of opaque labels
this response came only after Gemini first hit a 429 capacity retry on gemini-3.1-pro-preview

State reviewed:

Claude summary:

no usable response was returned on this pass
I tried both a full packet prompt and a much shorter delta-only prompt
both were invoked subscription-only with API-key env vars unset
both runs stayed live for minutes with no text output from the non-interactive CLI wrapper

Gemini summary:

no usable response was returned on this pass
I tried both a full packet prompt and a much shorter delta-only prompt
both were invoked subscription-only with API-key env vars unset
both runs stayed live for minutes with no text output from the non-interactive CLI wrapper

Operational note:

this failure mode looks different from the earlier 429 capacity errors
on this pass the wrappers launched the CLIs successfully, but neither call produced a usable review before timing became unreasonable

State reviewed:

Claude summary:

verdict: this is a serious Stage 1 attempt, not just a public-fit artifact
strongest parts: clean theorem-family separation, source-row semantics as the right middle abstraction, disciplined false-side escalation, and duplicate-aware residual search
main risks: the remaining 15 true cases may expose a real blind spot in the fingerprint scheme, and the remaining 21 false cases may need something other than larger SAT instances
highest-leverage next moves:
- structurally characterize the remaining 15 true cases
- attack the 10 unique residual false pairs with rewrite/divergence arguments rather than bigger models
- run a collision analysis on the 2-element fingerprint classes to see where the current cheatsheet abstraction genuinely loses information
code critique: the module split looks right, but source_row_semantics.py now deserves its own regression tests because it is central to the whole approach
cheatsheet critique: it is closer to a real decision procedure, but it still needs a more explicit definition of the fingerprint evaluation protocol and a cleaner separation between the evaluator and the lookup buckets

Gemini summary:

no usable response was returned from the Gemini CLI on this updated review pass
I retried three times, all subscription-only and with API-key env vars unset
one run stalled on the file-reading prompt, and a later summary-only retry also failed to return within the allotted wait window
earlier in this project Gemini also hit repeated server-side 429 capacity failures on gemini-3.1-pro-preview, so the current issue still looks like CLI/service instability rather than a local API-key path

The strongest outside strategic review so far came from the 5.4 Pro handoff packet.

Main takeaways:

the core direction is viable, but the winning abstraction is source-row semantics, not a collection of pair-local heuristics
the broader mixed-self-reference family is exact on the full graph and was the right theorem jump to encode
exact source-row semantics over tiny 2-element theories explains most of the remaining true mass after the theorem families
the false tail should be attacked as unique source-target pairs, not as repeated benchmark instances
the cheatsheet should converge toward a tiny symbolic evaluator rather than a prose memo

State reviewed:

Claude summary:

strongest result: the new source-law family was the real win
biggest caveat: 188 true and 37 false cases still unresolved
suggested next move: either formalize that family or push the false tail with larger finite-model search

Gemini summary:

strongest result: the new source-law family was the cleanest addition
biggest caveat: current small-model search and true-side heuristics hit a ceiling
suggested next move: formally prove the new family, then change the search strategy for the remaining false cases

State reviewed:

Claude summary:

strongest result: false side is now nearly closed
biggest caveat: diminishing returns are obvious on the false tail and the true residual is much larger
suggested next move: shift focus toward the true-side residual

Gemini summary:

strongest result: near-complete false-side coverage with exact public-label agreement
biggest caveat: size-5 timeouts plateaued; more budget at the same setup is not enough
suggested next move: change the structural search strategy for the remaining 21 false cases instead of just increasing timeout

These review runs were done subscription-only, with the common API-key environment variables explicitly unset before invoking the CLIs.

Gemini repeatedly hit server-side 429 capacity errors on gemini-3.1-pro-preview before returning.