External Review Notes
Short notes from the local Claude CLI and Gemini CLI review passes.
2026-03-15: Micro-Rewrite + Canonical False-Tail Pass
State reviewed:
- exact public decision surface stayed
1164/1200 - direct kernel bridge candidate stayed
1177/1200 - the new
2-kernel micro-rewrite layer pushed the teacher-side candidate surface to1179/1200 - the remaining candidate tail is now
21false and0true - canonicalizing the
7size-5unknown false pairs changed1tounsat, leaving6canonical unknowns
Claude summary:
- no usable response was returned on this pass
- I tried both the wrapper-based full prompt and a very short direct
claude --printprompt - both were invoked subscription-only with API-key env vars unset
- the local CLI stayed live with no text output from the non-interactive process
Gemini summary:
- strongest part: the evaluator now gets high coverage from a compact stack instead of from one brittle trick
- biggest risk: the remaining
6canonical size-5unknowns may force a much more expensive false-side search if the current solver surface does not improve - suggested next move: target those
6canonical unknowns with better constrained SAT/SMT or targeted enumeration - judgment on the
2-kernel micro-rewrite layer: it looks like a compact offline abstraction, not just a teacher-only patch, because it yields short modular proof traces instead of opaque labels - this response came only after Gemini first hit a
429capacity retry ongemini-3.1-pro-preview
2026-03-15: Pair Evaluator + Honest Kernel Bridge Pass
State reviewed:
- theorem-backed true coverage stayed
433/574 - fixed
10-table pair evaluator reached467/626public false - combined exact public surface stayed
1164/1200 - nontrivial kernel bridge candidate reached
1177/1200 - residuals under that candidate surface are now
2true and21false
Claude summary:
- no usable response was returned on this pass
- I tried both a full packet prompt and a much shorter delta-only prompt
- both were invoked subscription-only with API-key env vars unset
- both runs stayed live for minutes with no text output from the non-interactive CLI wrapper
Gemini summary:
- no usable response was returned on this pass
- I tried both a full packet prompt and a much shorter delta-only prompt
- both were invoked subscription-only with API-key env vars unset
- both runs stayed live for minutes with no text output from the non-interactive CLI wrapper
Operational note:
- this failure mode looks different from the earlier
429capacity errors - on this pass the wrappers launched the CLIs successfully, but neither call produced a usable review before timing became unreasonable
2026-03-14: Source-Row Semantics Jump
State reviewed:
- theorem-backed true coverage moved to
433/574 - exact
2-element source-row semantics covered669/1200public problems - combined public decision surface reached
1164/1200 - unresolved tail is now
15true and21false
Claude summary:
- verdict: this is a serious Stage 1 attempt, not just a public-fit artifact
- strongest parts: clean theorem-family separation, source-row semantics as the right middle abstraction, disciplined false-side escalation, and duplicate-aware residual search
- main risks: the remaining
15true cases may expose a real blind spot in the fingerprint scheme, and the remaining21false cases may need something other than larger SAT instances - highest-leverage next moves:
- structurally characterize the remaining
15true cases - attack the
10unique residual false pairs with rewrite/divergence arguments rather than bigger models - run a collision analysis on the
2-element fingerprint classes to see where the current cheatsheet abstraction genuinely loses information
- structurally characterize the remaining
- code critique: the module split looks right, but
source_row_semantics.pynow deserves its own regression tests because it is central to the whole approach - cheatsheet critique: it is closer to a real decision procedure, but it still needs a more explicit definition of the fingerprint evaluation protocol and a cleaner separation between the evaluator and the lookup buckets
Gemini summary:
- no usable response was returned from the Gemini CLI on this updated review pass
- I retried three times, all subscription-only and with API-key env vars unset
- one run stalled on the file-reading prompt, and a later summary-only retry also failed to return within the allotted wait window
- earlier in this project Gemini also hit repeated server-side
429capacity failures ongemini-3.1-pro-preview, so the current issue still looks like CLI/service instability rather than a local API-key path
2026-03-14: 5.4 Pro Review
The strongest outside strategic review so far came from the 5.4 Pro handoff packet.
Main takeaways:
- the core direction is viable, but the winning abstraction is source-row semantics, not a collection of pair-local heuristics
- the broader mixed-self-reference family is exact on the full graph and was the right theorem jump to encode
- exact source-row semantics over tiny
2-element theories explains most of the remaining true mass after the theorem families - the false tail should be attacked as unique source-target pairs, not as repeated benchmark instances
- the cheatsheet should converge toward a tiny symbolic evaluator rather than a prose memo
2026-03-14: Internal-Singleton Jump
State reviewed:
- theorem-backed true coverage moved to
386/574 - false coverage was still
589/626 - new globally perfect source-law family covered
141/574public true cases
Claude summary:
- strongest result: the new source-law family was the real win
- biggest caveat:
188true and37false cases still unresolved - suggested next move: either formalize that family or push the false tail with larger finite-model search
Gemini summary:
- strongest result: the new source-law family was the cleanest addition
- biggest caveat: current small-model search and true-side heuristics hit a ceiling
- suggested next move: formally prove the new family, then change the search strategy for the remaining false cases
2026-03-14: Size-5 SAT Pass
State reviewed:
- theorem-backed true coverage stayed
386/574 - false coverage moved to
605/626 - size-
5SAT at1000 msadded16more false witnesses - raising those residual size-
5cases to3000 msadded0more witnesses
Claude summary:
- strongest result: false side is now nearly closed
- biggest caveat: diminishing returns are obvious on the false tail and the true residual is much larger
- suggested next move: shift focus toward the true-side residual
Gemini summary:
- strongest result: near-complete false-side coverage with exact public-label agreement
- biggest caveat: size-
5timeouts plateaued; more budget at the same setup is not enough - suggested next move: change the structural search strategy for the remaining
21false cases instead of just increasing timeout
Operational Note
These review runs were done subscription-only, with the common API-key environment variables explicitly unset before invoking the CLIs.
Gemini repeatedly hit server-side 429 capacity errors on gemini-3.1-pro-preview before returning.