External Review Notes

Short notes from the local Claude CLI and Gemini CLI review passes.

2026-03-15: Micro-Rewrite + Canonical False-Tail Pass

State reviewed:

  • exact public decision surface stayed 1164/1200
  • direct kernel bridge candidate stayed 1177/1200
  • the new 2-kernel micro-rewrite layer pushed the teacher-side candidate surface to 1179/1200
  • the remaining candidate tail is now 21 false and 0 true
  • canonicalizing the 7 size-5 unknown false pairs changed 1 to unsat, leaving 6 canonical unknowns

Claude summary:

  • no usable response was returned on this pass
  • I tried both the wrapper-based full prompt and a very short direct claude --print prompt
  • both were invoked subscription-only with API-key env vars unset
  • the local CLI stayed live with no text output from the non-interactive process

Gemini summary:

  • strongest part: the evaluator now gets high coverage from a compact stack instead of from one brittle trick
  • biggest risk: the remaining 6 canonical size-5 unknowns may force a much more expensive false-side search if the current solver surface does not improve
  • suggested next move: target those 6 canonical unknowns with better constrained SAT/SMT or targeted enumeration
  • judgment on the 2-kernel micro-rewrite layer: it looks like a compact offline abstraction, not just a teacher-only patch, because it yields short modular proof traces instead of opaque labels
  • this response came only after Gemini first hit a 429 capacity retry on gemini-3.1-pro-preview

2026-03-15: Pair Evaluator + Honest Kernel Bridge Pass

State reviewed:

  • theorem-backed true coverage stayed 433/574
  • fixed 10-table pair evaluator reached 467/626 public false
  • combined exact public surface stayed 1164/1200
  • nontrivial kernel bridge candidate reached 1177/1200
  • residuals under that candidate surface are now 2 true and 21 false

Claude summary:

  • no usable response was returned on this pass
  • I tried both a full packet prompt and a much shorter delta-only prompt
  • both were invoked subscription-only with API-key env vars unset
  • both runs stayed live for minutes with no text output from the non-interactive CLI wrapper

Gemini summary:

  • no usable response was returned on this pass
  • I tried both a full packet prompt and a much shorter delta-only prompt
  • both were invoked subscription-only with API-key env vars unset
  • both runs stayed live for minutes with no text output from the non-interactive CLI wrapper

Operational note:

  • this failure mode looks different from the earlier 429 capacity errors
  • on this pass the wrappers launched the CLIs successfully, but neither call produced a usable review before timing became unreasonable

2026-03-14: Source-Row Semantics Jump

State reviewed:

  • theorem-backed true coverage moved to 433/574
  • exact 2-element source-row semantics covered 669/1200 public problems
  • combined public decision surface reached 1164/1200
  • unresolved tail is now 15 true and 21 false

Claude summary:

  • verdict: this is a serious Stage 1 attempt, not just a public-fit artifact
  • strongest parts: clean theorem-family separation, source-row semantics as the right middle abstraction, disciplined false-side escalation, and duplicate-aware residual search
  • main risks: the remaining 15 true cases may expose a real blind spot in the fingerprint scheme, and the remaining 21 false cases may need something other than larger SAT instances
  • highest-leverage next moves:
    • structurally characterize the remaining 15 true cases
    • attack the 10 unique residual false pairs with rewrite/divergence arguments rather than bigger models
    • run a collision analysis on the 2-element fingerprint classes to see where the current cheatsheet abstraction genuinely loses information
  • code critique: the module split looks right, but source_row_semantics.py now deserves its own regression tests because it is central to the whole approach
  • cheatsheet critique: it is closer to a real decision procedure, but it still needs a more explicit definition of the fingerprint evaluation protocol and a cleaner separation between the evaluator and the lookup buckets

Gemini summary:

  • no usable response was returned from the Gemini CLI on this updated review pass
  • I retried three times, all subscription-only and with API-key env vars unset
  • one run stalled on the file-reading prompt, and a later summary-only retry also failed to return within the allotted wait window
  • earlier in this project Gemini also hit repeated server-side 429 capacity failures on gemini-3.1-pro-preview, so the current issue still looks like CLI/service instability rather than a local API-key path

2026-03-14: 5.4 Pro Review

The strongest outside strategic review so far came from the 5.4 Pro handoff packet.

Main takeaways:

  • the core direction is viable, but the winning abstraction is source-row semantics, not a collection of pair-local heuristics
  • the broader mixed-self-reference family is exact on the full graph and was the right theorem jump to encode
  • exact source-row semantics over tiny 2-element theories explains most of the remaining true mass after the theorem families
  • the false tail should be attacked as unique source-target pairs, not as repeated benchmark instances
  • the cheatsheet should converge toward a tiny symbolic evaluator rather than a prose memo

2026-03-14: Internal-Singleton Jump

State reviewed:

  • theorem-backed true coverage moved to 386/574
  • false coverage was still 589/626
  • new globally perfect source-law family covered 141/574 public true cases

Claude summary:

  • strongest result: the new source-law family was the real win
  • biggest caveat: 188 true and 37 false cases still unresolved
  • suggested next move: either formalize that family or push the false tail with larger finite-model search

Gemini summary:

  • strongest result: the new source-law family was the cleanest addition
  • biggest caveat: current small-model search and true-side heuristics hit a ceiling
  • suggested next move: formally prove the new family, then change the search strategy for the remaining false cases

2026-03-14: Size-5 SAT Pass

State reviewed:

  • theorem-backed true coverage stayed 386/574
  • false coverage moved to 605/626
  • size-5 SAT at 1000 ms added 16 more false witnesses
  • raising those residual size-5 cases to 3000 ms added 0 more witnesses

Claude summary:

  • strongest result: false side is now nearly closed
  • biggest caveat: diminishing returns are obvious on the false tail and the true residual is much larger
  • suggested next move: shift focus toward the true-side residual

Gemini summary:

  • strongest result: near-complete false-side coverage with exact public-label agreement
  • biggest caveat: size-5 timeouts plateaued; more budget at the same setup is not enough
  • suggested next move: change the structural search strategy for the remaining 21 false cases instead of just increasing timeout

Operational Note

These review runs were done subscription-only, with the common API-key environment variables explicitly unset before invoking the CLIs.

Gemini repeatedly hit server-side 429 capacity errors on gemini-3.1-pro-preview before returning.