Benchmarking LLMs on hard topological reasoning

TopoBench

Benchmark Overview

Six puzzle families with unique topological and geometric constraints

Six TopoBench puzzle families arranged by the global spatial constraints they test.
TopoBench covers connectivity, loop closure, symmetry, reflection, and contiguity across six puzzle families and three difficulty tiers.

Benchmark Design

How the benchmark is built

TopoBench is designed to test whether language models can preserve global spatial constraints over multiple reasoning steps without falling back on external tools or code execution.

900 total puzzle instances
6 × 3 families and difficulty tiers
50 instances per family-tier split
100k max output tokens for extended reasoning

Difficulty scales along two axes at once: board size and the generator’s internal difficulty dial. That means harder tiers demand deeper reasoning, not just larger grids.

Every puzzle family has its own verifier, and model outputs are parsed into structured solutions and checked against the complete constraint set with no partial credit. Evaluation is pass@1 throughout.

Main Results

Verifier-checked accuracy across models and puzzle families

Frontier models form a clear top tier, but performance collapses on hard puzzles and several families remain close to unsolved.

Combined Accuracy Table

Rows with all-zero accuracy are omitted, matching the compact presentation in the paper.

Model FlowFree Bridges Loopy Galaxies Undead Pattern Avg

Failure Diagnosis

Why the models fail

The paper combines observational trace analysis with causal interventions to separate frequent behaviors from truly damaging ones.

Observational analysis

The authors label 750 reasoning traces from DeepSeek V3.2 with an error taxonomy using an LLM-as-a-judge pipeline, then focus analysis on the incorrect traces.

Explicit surrender Repeated reasoning Representation drift State-tracking failure

Causal finding

Frequency is a poor proxy for importance. Constraint forgetting appears rarely in traces, but produces one of the largest downstream accuracy drops when injected into partial gold solutions.

Premature commitment Constraint forgetting

Format Intervention

Input format intervention

We test whether part of the benchmark difficulty comes from how grid structure gets flattened into tokens before any reasoning begins.

ASCII default text grid with ragged token boundaries
IntFormat comma-separated integers with cell-aligned tokenization
JSON 2D structured lists that preserve explicit boundaries
ASCII + Image multimodal input that adds a rendered puzzle image

Under standard tokenization, ASCII rows do not line up cleanly with cells. Tokens often straddle cell boundaries, which makes it harder for a model to reconstruct the board consistently from a flat token stream.

IntFormat and IntFormat-JSON use more tokens overall, but they preserve regular cell alignment. That trades a larger prompt for a cleaner spatial representation.

Tool-Augmented Reasoning

Tool-augmented reasoning on hard Bridges

To test whether the real problem is parsing the initial board or maintaining constraints during multi-step reasoning, the paper gives DeepSeek V3.2 an external state interface on hard Bridges only.

40% baseline accuracy with no tools
46% structured-only tools
42% structured tools plus render_board
50% full structured query suite

The tools do not solve the puzzle. They expose the board state through `make_move`, `state_summary`, `neighbors`, and `components`, and optionally through a rendered ASCII board.

Structured-only tools remove board-validity errors entirely and raise accuracy, which shows that better access to the right intermediate state is already enough to recover meaningful performance.