Benchmarking LLMs on hard topological reasoning
TopoBench
Benchmark Overview
Six puzzle families with unique topological and geometric constraints
Benchmark Design
How the benchmark is built
TopoBench is designed to test whether language models can preserve global spatial constraints over multiple reasoning steps without falling back on external tools or code execution.
Difficulty scales along two axes at once: board size and the generator’s internal difficulty dial. That means harder tiers demand deeper reasoning, not just larger grids.
Every puzzle family has its own verifier, and model outputs are parsed into structured solutions and checked against the complete constraint set with no partial credit. Evaluation is pass@1 throughout.
Main Results
Verifier-checked accuracy across models and puzzle families
Frontier models form a clear top tier, but performance collapses on hard puzzles and several families remain close to unsolved.
| Model | FlowFree | Bridges | Loopy | Galaxies | Undead | Pattern | Avg |
|---|
Failure Diagnosis
Why the models fail
The paper combines observational trace analysis with causal interventions to separate frequent behaviors from truly damaging ones.
Observational analysis
The authors label 750 reasoning traces from DeepSeek V3.2 with an error taxonomy using an LLM-as-a-judge pipeline, then focus analysis on the incorrect traces.
Causal finding
Frequency is a poor proxy for importance. Constraint forgetting appears rarely in traces, but produces one of the largest downstream accuracy drops when injected into partial gold solutions.
Format Intervention
Input format intervention
We test whether part of the benchmark difficulty comes from how grid structure gets flattened into tokens before any reasoning begins.
Under standard tokenization, ASCII rows do not line up cleanly with cells. Tokens often straddle cell boundaries, which makes it harder for a model to reconstruct the board consistently from a flat token stream.
IntFormat and IntFormat-JSON use more tokens overall, but they preserve regular cell alignment. That trades a larger prompt for a cleaner spatial representation.
Tool-Augmented Reasoning
Tool-augmented reasoning on hard Bridges
To test whether the real problem is parsing the initial board or maintaining constraints during multi-step reasoning, the paper gives DeepSeek V3.2 an external state interface on hard Bridges only.
The tools do not solve the puzzle. They expose the board state through `make_move`, `state_summary`, `neighbors`, and `components`, and optionally through a rendered ASCII board.
Structured-only tools remove board-validity errors entirely and raise accuracy, which shows that better access to the right intermediate state is already enough to recover meaningful performance.