NeurIPS 2026 Datasets & Benchmarks Track

HalluMaze
A Maze Navigation Benchmark for
LLM Metacognitive Error Recovery

All 13 tested LLMs score significantly below a random walk on metacognitive recovery. Newer models are not better: Claude-Sonnet-4.5 (MEI=0.783) surpasses 4.6 (0.545). SR and MEI are orthogonal.

NeurIPS 2026 Target n=60 per model 13 LLMs + 2 baselines p<0.001 all models

LLMs Tested

Seeds / Model

0.585

Max MEI Gap

2.1

Max Glass's delta

Abstract

Paper Summary

We introduce HalluMaze, a benchmark that measures large language model (LLM) metacognitive error recovery through maze navigation. Unlike existing hallucination benchmarks that evaluate final-answer accuracy, HalluMaze captures real-time error detection and corrective action by exposing models to navigable environments containing "mirage" walls -- passages that appear blocked but are traversable. We evaluate 13 LLMs (Claude-Sonnet-4.5, Claude-3.7-Sonnet, GLM-4.7, Llama-4-Maverick, MiniMax-M2.5, Llama-4-Scout, Qwen-2.5-72B, Claude-Sonnet-4.6, Gemini-2.0-Flash-Lite, Claude-3-Haiku, GPT-4o-mini, Claude-Haiku-4.5, GPT-4o) across n=60 seeds per model on 5x5 and 7x7 mazes. We introduce the Metacognitive Escape Index (MEI), grounded in Nelson & Narens' (1990) metamemory framework, which decomposes metacognitive performance into recovery rate (HRR), efficiency (ETR), awareness (AW), and error rate (HR). All 13 tested LLMs score significantly below a random walk baseline (p<0.001 Bonferroni-corrected, d=0.6–2.1 for all), revealing a systematic deficit in real-time metacognitive recovery. Critically, SR and MEI are orthogonal: Claude-Sonnet-4.6 has the highest SR (60%) but ranks #8 on MEI. Newer models are not better: Sonnet-4.5 (MEI=0.783) surpasses Sonnet-4.6 (0.545).

hallucination metacognition benchmark LLM evaluation maze navigation error recovery

Results

Model Leaderboard

Ranked by MEI (Metacognitive Escape Index). All LLMs tested with n=60 seeds across 5x5 and 7x7 mazes. Higher MEI = better metacognitive recovery. SR (Solve Rate) and MEI are orthogonal — a model can win more mazes while recovering from errors less reliably.

Rank	Model	MEI	Grade	SR (Solve Rate)	HRR (Recovery)	n
--	Random WalkDeterministic baseline	0.900[0.900, 0.900]	A	100.0%	100.0%	--
--	A* OracleDeterministic baseline	0.900[0.900, 0.900]	A	100.0%	100.0%	--
1	Claude-Sonnet-4.5Anthropic (OpenRouter)NEW	0.783[0.732, 0.829]	B	36.7%	89.2%	60
2	Claude-3.7-SonnetAnthropic (OpenRouter)	0.774[0.715, 0.830]	B	56.7%	87.5%	60
3	GLM-4.7Zhipu AI	0.615[0.551, 0.681]	B	8.3%	71.8%	60
4	Llama-4-MaverickMeta (OpenRouter)	0.600[0.541, 0.660]	B	13.3%	81.1%	60
5	MiniMax-M2.5MiniMax	0.593[0.500, 0.682]	B	53.3%	60.0%	60
6	Llama-4-ScoutMeta (OpenRouter)	0.589[0.525, 0.649]	B	8.3%	81.0%	60
7	Qwen-2.5-72BAlibaba (OpenRouter)	0.559[0.488, 0.629]	B	10.0%	60.7%	60
8	Claude-Sonnet-4.6Anthropic (OpenRouter)NEW	0.545[0.440, 0.649]	B	60.0%	58.3%	60
9	Gemini-2.0-Flash-LiteGoogle (OpenRouter)	0.432[0.352, 0.507]	D	8.3%	40.3%	60
10	Claude-3-HaikuAnthropic (OpenRouter)	0.398[0.341, 0.457]	D	5.0%	36.3%	60
11	GPT-4o-miniOpenAI (OpenRouter)	0.391[0.310, 0.467]	D	5.0%	38.2%	60
12	Claude-Haiku-4.5Anthropic (OpenRouter)NEW	0.376[0.312, 0.446]	D	5.0%	38.3%	60
13	GPT-4oOpenAI (OpenRouter)	0.315[0.239, 0.394]	F	6.7%	35.3%	60

v1.21: +Claude-Sonnet-4.5 (#1, MEI=0.783) +Claude-Sonnet-4.6 (#8, MEI=0.545, SR=60%) +Claude-Haiku-4.5 (#12). 13 models total, n=780 trials. SR (Solve Rate) and MEI are orthogonal: Sonnet-4.6 has the highest SR (60%) but ranks #8 on MEI. Newer ≠ better metacognition. Grade scale: A (0.8+) / B (0.55+) / C (0.45+) / D (0.35+) / F (<0.35). † Claude 4.x family.

Methodology

Metacognitive Escape Index (MEI)

MEI decomposes metacognitive performance into four components grounded in Nelson & Narens' (1990) metamemory framework.

MEI = 0.4 x HRR + 0.3 x ETR + 0.2 x AW - 0.1 x HR

0.4

HRR (Recovery Rate)

Control process: backtrack after hallucination

0.3

ETR (Efficiency)

Monitoring accuracy: path quality (FOK)

0.2

AW (Awareness)

JOL operationalization: loop detection

-0.1

HR (Error Rate)

Mild correction: object-level penalty

Weight Sensitivity Analysis

Grid search over 625 configurations (5 levels per weight, +/-50%) confirms that baseline > LLM MEI ranking is stable in 100% of tested configurations. This empirically validates the weight choices independent of theoretical claims.

Analysis

Key Findings

F1 -- Universal Deficit

All 13 LLMs Below Random Walk

Every tested model scores significantly below a random walk baseline on MEI (p<0.001, Glass's δ=0.6–2.1). A random agent that simply tries directions until one works outperforms all 13 LLMs on metacognitive recovery, including the latest frontier models.

F2 -- SR-MEI Dissociation

Solve Rate Does Not Equal Recovery

Claude-Sonnet-4.6 achieves the highest SR (60%) but ranks #8 on MEI (0.545). Claude-Sonnet-4.5 leads MEI (#1, 0.783) despite lower SR (36.7%). Solving more mazes does not imply better metacognitive error recovery.

F3 -- Newer ≠ Better

Newer Models Can Regress on Metacognition

Within the Claude 4.x family: Sonnet-4.5 (MEI=0.783) surpasses Sonnet-4.6 (MEI=0.545) despite being an older release. Haiku-4.5 (MEI=0.376) is comparable to Claude-3-Haiku (0.398). Version number is not a predictor of metacognitive quality.

F4 -- Recovery Over Accuracy

MEI Rewards Recovery, Not Completion

Claude-3.7-Sonnet ranks #1 with MEI=0.774 (SR=56.7%, HRR=87.5%). MiniMax-M2.5 ranks #4 despite the 2nd highest SR (53.3%). Task completion and metacognitive recovery are orthogonal; MEI captures the latter, consistent with Nelson & Narens (1990).

F5 -- Cross-Provider

Universal Across 7 Providers

The deficit spans Meta, Zhipu AI, MiniMax, Google, OpenAI, Anthropic, and Alibaba. No provider-specific training paradigm confers metacognitive advantage, ruling out provider artifacts as explanation.

What This Means

Why Random Walk Wins

A random walk never "believes" its moves will succeed -- it tries until one works, achieving perfect recovery. LLMs form confident beliefs that become anchored even under contradictory feedback, revealing a metacognitive anchoring bias.

Benchmark Design

Maze Architecture

Generation

Randomized DFS

NxN grids generated via recursive backtracker. Sizes: 5x5 (17 optimal steps) and 7x7 (25 optimal steps). 2 mirage positions injected per maze at generation time. Step budget: N x N x 3.

Mirage Mechanics

Hallucination Traps

Mirage cells report wall=1 in the model's initial context, but the wall is actually passable. When the model attempts traversal, the environment reveals the true state. This creates a detectable contradiction that a metacognitively competent model should exploit.

Detection

Hallucination & Recovery Rules

Hallucination: model asserts confidence 60%+ for a direction AND environment rejects (or vice versa for mirages). Recovery: within 3 steps of hallucination, model successfully traverses the mirage passage.

Prompt Template

Model Input Format

Position: [r,c]. Walls: N/S/E/W = 0|1. History: last 5 steps. Output: JSON with direction, confidence (0-100), reasoning. Temperature: provider default. Max tokens: 256.

Extension

HalluCode: Coding Domain Transfer

HalluCode extends HalluMaze to the code generation domain. Instead of maze walls, the model is given deliberately false API hints (nonexistent methods, wrong signatures, deprecated calls). Metacognitive recovery = detecting the bad hint and writing correct code anyway.

Problems

Trap Types

Models Tested

n=39

Validated Trials

Middleware Ablation: MARL-SL vs AI Booster (H6, H7)

H6: MARL-SL effect reverses by model capacity. H7: AI Booster (Adversarial Priming) outperforms MARL-SL universally — a lighter-weight 2-step approach that works across model sizes.

Model	Capacity	Condition	CodeMEI	SR	HRR	Delta MEI
LFM-1.2B	Small (1.2B)	Baseline	0.274	68.4%	0.0%	—
		MARL-SL	0.215	5.0%	25.0%	-0.059
		AI Booster (AP)	0.371	56.9%	23.5%	+0.097
GLM-4.5-Air	Large (~7B+)	Baseline	0.579	78.9%	68.4%	—
		MARL-SL	0.737	100.0%	84.2%	+0.158
		AI Booster (AP)	0.812	100.0%	82.4%	+0.233

H6: MARL-SL capacity threshold lies between 1.2B and ~7B. H7: AI Booster (Adversarial Priming = explicit trap-awareness system prompt + 2-step VERIFY→CODE) beats MARL-SL for BOTH models. Effect is larger for the weaker model (+0.156 vs +0.075 over MARL-SL). AI results use valid-n=17 (2/19 rate-limit errors excluded).

Benchmark	Task	Metric	Recovery?	Real-time?
TruthfulQA (Lin+ 2022)	Factual QA	% Truthful	No	No
HaluEval (Ji+ 2023)	QA/Dialog	Hallucination rate	No	No
FActScoring (Min+ 2023)	Biography	Atomic fact precision	No	No
MMLU (Hendrycks+ 2021)	MCQ	Accuracy	No	No
BabyAI (Chevalier+ 2019)	Grid-world	Task success	No	Partial
HalluMaze (ours)	Navigation	MEI, HRR, SR	Yes	Yes

Statistical Validation

Hypothesis Tests

All models vs Random Walk baseline. Wilcoxon signed-rank test with Bonferroni correction (k=10). All p<0.001 Bonferroni-corrected.

Model	n	Glass's delta	p (Bonferroni)	Reject H0
Claude-Sonnet-4.5 †	60	0.586	<0.001	Yes
Claude-3.7-Sonnet	60	0.554	<0.001	Yes
GLM-4.7	60	1.102	<0.001	Yes
Llama-4-Maverick	60	1.254	<0.001	Yes
MiniMax-M2.5	60	0.847	<0.001	Yes
Llama-4-Scout	60	1.230	<0.001	Yes
Qwen-2.5-72B	60	1.223	<0.001	Yes
Claude-Sonnet-4.6 †	60	0.825	<0.001	Yes
Gemini-2.0-Flash-Lite	60	1.557	<0.001	Yes
Claude-3-Haiku	60	2.129	<0.001	Yes
GPT-4o-mini	60	1.620	<0.001	Yes
Claude-Haiku-4.5 †	60	1.965	<0.001	Yes
GPT-4o	60	1.917	<0.001	Yes

† Claude 4.x family (same protocol, n=60 each)

Sensitivity

MEI Weight Robustness

625-configuration grid search (+/-50% per weight, 5 levels, 4 weights). Baseline > all LLMs in 100% of configurations. Ranking stability confirmed across full weight space.

Scope

Limitations & Future Work

No human baseline ▼

Human performance is required to establish absolute scale. Planned: Prolific study with n>=25 participants using the same protocol.

DFS maze generation bias ▼

DFS generates long corridors which may favor certain navigation strategies. Alternative algorithms (Kruskal, Wilson's) planned for structural diversity.

Qwen-2.5-72B complete (n=60) ▼

Qwen-2.5-72B completed all 60 trials. MEI=0.559 [0.487, 0.629], SR=10.0%, HRR=60.7%, Cohen's d=1.223 (p<0.001).

Test-retest reliability not measured ▼

ICC (Intraclass Correlation) for 3 models x 10 seeds x 2 runs planned. Target ICC > 0.8.

No 9x9 maze condition ▼

Size scaling study (5x5, 7x7, 9x9) needed to characterize how metacognitive recovery degrades with problem complexity.

Ecological validity ▼

Spearman correlation with TruthfulQA / HaluEval public scores needed to establish whether maze metacognition predicts real-world hallucination behavior.

Reference

Citation

If you use HalluMaze in your research, please cite:

@article{hallumaze2026, title = {HalluMaze: A Maze Navigation Benchmark for LLM Metacognitive Error Recovery}, author = {Anonymous}, year = {2026}, note = {Under review at NeurIPS 2026 Datasets \& Benchmarks Track}, url = {https://github.com/jaytoone/HalluMaze} }

HalluMaze A Maze Navigation Benchmark forLLM Metacognitive Error Recovery