SPECTER Labs
Technical Docs

NeuroMonster Intro Variants

Prepared for the neuromonster.org 250-word research intro submission. Word counts below exclude headings.

Version A: Balanced

Wonton Soup studies proof-search behavior under intervention. We use an MCTS-based theorem prover in Lean, guided by tactic providers that include finetuned local language models and heuristic baselines, and ask a simple question: when we block part of a successful proof route, does search recover the same proof family, reroute into a different one, or fail?

Our intervention protocol starts from a solved wild-type run, blocks one tactic or tactic family from the solution path, and reruns under the same theorem slice, search budget, and configuration. Across repeated seeded runs, we log iteration-level search histories, search graphs, and proof-term or trace-derived graph artifacts. We then compare wild-type and intervention outcomes with graph edit distance, trajectory divergence and reconvergence metrics, and multi-seed basin analysis.

Across interventions, three response modes recur: preservation of the same proof family, rerouting into a distinct proof family, and collapse. The existence of stable reroutes and measurable recovery dynamics supports a limited proto-cognitive claim: some proof-search systems can preserve goal-directed competence under perturbation through more than one recurrent structural route.

A paired Lean-Rocq benchmark on matched theorem sets provides early cross-assistant evidence for this picture, showing a real but still bounded structural signal. Coverage remains incomplete, so stronger generalization claims will require broader corpora, more providers, and deeper cross-assistant evaluation.

By putting robustness and multistability inside a reproducible formal setting, Wonton Soup turns theorem proving into a concrete testbed for hypotheses about goal-directed behavior in artificial systems.

Word count: 240

Version B: Conservative Formal-Methods Framing

Wonton Soup is a reproducible intervention framework for studying robustness in MCTS-based theorem proving. We run theorem proving in Lean with tactic providers that include finetuned local language models and heuristic baselines, then compare wild-type and constrained runs on matched theorem slices. The core intervention is simple: after a wild-type proof is found, we block one tactic or tactic family from its solution path and rerun under the same budget and configuration. Across repeated seeded runs, we record iteration-level search histories, search graphs, and proof-term or trace-derived graph artifacts.

We evaluate whether the constrained run stays in the same proof family, reroutes into a distinct one, or fails. The analysis combines graph edit distance, trajectory divergence and recovery metrics, and multi-seed basin analysis, which measures how often the same structural family reappears under fixed settings.

Across current experiments, all three modes occur. Some proofs are structurally stable under intervention; some reliably recover through alternate routes; others collapse when key steps are blocked. These results suggest that robustness in theorem proving is not captured by solve rate alone. Structural response under intervention provides a sharper view of how search policies use the proof space.

All comparisons keep theorem selection, budget, and artifact contracts explicit and reproducible.

A paired Lean-Rocq benchmark on matched theorem sets provides early cross-assistant support for this approach, with a real but limited structural signal. Coverage is still incomplete, so broader claims will require wider corpora and deeper evaluation.

Word count: 241

Version C: NeuroMonster-Facing

Wonton Soup asks whether proof search can show simple cognitive-like signatures under controlled damage. We use an MCTS-based theorem prover in Lean, guided by tactic providers that include finetuned local language models and heuristic baselines, and intervene directly on successful proof routes. After a wild-type run solves a theorem, we block one tactic or tactic family from that solution path and rerun under the same theorem slice, budget, and configuration. Across repeated seeded runs, we log iteration-level search histories, search graphs, and proof-term or trace-derived graph artifacts.

This lets us ask not just whether the system still solves the theorem, but how it responds. Does it preserve the same proof family, recover through a different recurrent route, or collapse? We measure those outcomes with graph edit distance, trajectory divergence and reconvergence, and multi-seed basin analysis.

Across interventions, all three response modes recur. The most interesting cases are stable reroutes: the system loses one available route yet still regains goal-directed performance through another. That does not justify a strong claim about cognition, but it does support a limited proto-cognitive interpretation. Under perturbation, proof search can sometimes preserve competence through more than one recurrent structural route.

A paired Lean-Rocq benchmark on matched theorem sets adds early cross-assistant evidence, showing a real but bounded structural signal to date. By placing robustness, rerouting, and multistability inside a reproducible formal setting, Wonton Soup offers a concrete testbed for hypotheses about goal-directed behavior in artificial systems.

Word count: 240

Version D: Findings-Forward

Wonton Soup is a reproducible intervention framework for studying robustness in MCTS-based theorem proving. We run theorem proving in Lean with tactic providers that include finetuned local language models and heuristic baselines, then compare an unconstrained baseline run to intervention runs, where we deliberately block one tactic or tactic family from a successful proof route and rerun the same theorem under the same budget and configuration. Across repeated seeded runs, we record iteration-level search histories, search graphs, and proof-term or trace-derived graph artifacts.

On the current matched slice, these lesions produce three recurring outcomes: replicate, where the same proof family reappears; reroute, where search reaches the theorem through a distinct proof family; and collapse, where the constrained run fails. Collapse is the plurality, but stable rerouting is a substantial minority. This suggests that theorem proving under a fixed policy is neither uniformly brittle nor locked to one derivational route. In that limited sense, a theorem is not exhausted by a single successful derivation: some theorem-policy pairs admit alternate recurrent proof families that search can re-enter after perturbation.

Graph edit distance, trajectory divergence and recovery metrics, and multi-seed basin analysis let us measure those regimes directly rather than collapsing everything into solve rate. A paired Lean-Rocq benchmark on matched theorem sets provides early cross-assistant support for this picture, with a real but limited structural signal. Coverage is still incomplete, so broader claims will require wider corpora and deeper evaluation.

Word count: 238