NeurIPS 2026 Datasets & Benchmarks Track

HalluMaze
A Maze Navigation Benchmark for
LLM Metacognitive Error Recovery

All 13 tested LLMs score significantly below a random walk on metacognitive recovery. Newer models are not better: Claude-Sonnet-4.5 (MEI=0.783) surpasses 4.6 (0.545). SR and MEI are orthogonal.

NeurIPS 2026 Target n=60 per model 13 LLMs + 2 baselines p<0.001 all models
13
LLMs Tested
60
Seeds / Model
0.585
Max MEI Gap
2.1
Max Glass's delta
📋 arXiv Endorsement Needed (cs.AI) — code: UCHUCB · contact: be2jay67@gmail.com
HuggingFace Discussion Email to Endorse
Abstract

Paper Summary

We introduce HalluMaze, a benchmark that measures large language model (LLM) metacognitive error recovery through maze navigation. Unlike existing hallucination benchmarks that evaluate final-answer accuracy, HalluMaze captures real-time error detection and corrective action by exposing models to navigable environments containing "mirage" walls -- passages that appear blocked but are traversable. We evaluate 13 LLMs (Claude-Sonnet-4.5, Claude-3.7-Sonnet, GLM-4.7, Llama-4-Maverick, MiniMax-M2.5, Llama-4-Scout, Qwen-2.5-72B, Claude-Sonnet-4.6, Gemini-2.0-Flash-Lite, Claude-3-Haiku, GPT-4o-mini, Claude-Haiku-4.5, GPT-4o) across n=60 seeds per model on 5x5 and 7x7 mazes. We introduce the Metacognitive Escape Index (MEI), grounded in Nelson & Narens' (1990) metamemory framework, which decomposes metacognitive performance into recovery rate (HRR), efficiency (ETR), awareness (AW), and error rate (HR). All 13 tested LLMs score significantly below a random walk baseline (p<0.001 Bonferroni-corrected, d=0.6–2.1 for all), revealing a systematic deficit in real-time metacognitive recovery. Critically, SR and MEI are orthogonal: Claude-Sonnet-4.6 has the highest SR (60%) but ranks #8 on MEI. Newer models are not better: Sonnet-4.5 (MEI=0.783) surpasses Sonnet-4.6 (0.545).

hallucination metacognition benchmark LLM evaluation maze navigation error recovery
Results

Model Leaderboard

Ranked by MEI (Metacognitive Escape Index). All LLMs tested with n=60 seeds across 5x5 and 7x7 mazes. Higher MEI = better metacognitive recovery. SR (Solve Rate) and MEI are orthogonal — a model can win more mazes while recovering from errors less reliably.

Rank Model MEI Grade SR (Solve Rate) HRR (Recovery) n
-- Random WalkDeterministic baseline 0.900[0.900, 0.900] A 100.0%
100.0%
--
-- A* OracleDeterministic baseline 0.900[0.900, 0.900] A 100.0%
100.0%
--
1 Claude-Sonnet-4.5Anthropic (OpenRouter)NEW 0.783[0.732, 0.829] B 36.7%
89.2%
60
2 Claude-3.7-SonnetAnthropic (OpenRouter) 0.774[0.715, 0.830] B 56.7%
87.5%
60
3 GLM-4.7Zhipu AI 0.615[0.551, 0.681] B 8.3%
71.8%
60
4 Llama-4-MaverickMeta (OpenRouter) 0.600[0.541, 0.660] B 13.3%
81.1%
60
5 MiniMax-M2.5MiniMax 0.593[0.500, 0.682] B 53.3%
60.0%
60
6 Llama-4-ScoutMeta (OpenRouter) 0.589[0.525, 0.649] B 8.3%
81.0%
60
7 Qwen-2.5-72BAlibaba (OpenRouter) 0.559[0.488, 0.629] B 10.0%
60.7%
60
8 Claude-Sonnet-4.6Anthropic (OpenRouter)NEW 0.545[0.440, 0.649] B 60.0%
58.3%
60
9 Gemini-2.0-Flash-LiteGoogle (OpenRouter) 0.432[0.352, 0.507] D 8.3%
40.3%
60
10 Claude-3-HaikuAnthropic (OpenRouter) 0.398[0.341, 0.457] D 5.0%
36.3%
60
11 GPT-4o-miniOpenAI (OpenRouter) 0.391[0.310, 0.467] D 5.0%
38.2%
60
12 Claude-Haiku-4.5Anthropic (OpenRouter)NEW 0.376[0.312, 0.446] D 5.0%
38.3%
60
13 GPT-4oOpenAI (OpenRouter) 0.315[0.239, 0.394] F 6.7%
35.3%
60

v1.21: +Claude-Sonnet-4.5 (#1, MEI=0.783) +Claude-Sonnet-4.6 (#8, MEI=0.545, SR=60%) +Claude-Haiku-4.5 (#12). 13 models total, n=780 trials. SR (Solve Rate) and MEI are orthogonal: Sonnet-4.6 has the highest SR (60%) but ranks #8 on MEI. Newer ≠ better metacognition. Grade scale: A (0.8+) / B (0.55+) / C (0.45+) / D (0.35+) / F (<0.35). † Claude 4.x family.

Methodology

Metacognitive Escape Index (MEI)

MEI decomposes metacognitive performance into four components grounded in Nelson & Narens' (1990) metamemory framework.

MEI = 0.4 x HRR + 0.3 x ETR + 0.2 x AW - 0.1 x HR
0.4
HRR (Recovery Rate)
Control process: backtrack after hallucination
0.3
ETR (Efficiency)
Monitoring accuracy: path quality (FOK)
0.2
AW (Awareness)
JOL operationalization: loop detection
-0.1
HR (Error Rate)
Mild correction: object-level penalty

Weight Sensitivity Analysis

Grid search over 625 configurations (5 levels per weight, +/-50%) confirms that baseline > LLM MEI ranking is stable in 100% of tested configurations. This empirically validates the weight choices independent of theoretical claims.

Analysis

Key Findings

F1 -- Universal Deficit
All 13 LLMs Below Random Walk
Every tested model scores significantly below a random walk baseline on MEI (p<0.001, Glass's δ=0.6–2.1). A random agent that simply tries directions until one works outperforms all 13 LLMs on metacognitive recovery, including the latest frontier models.
F2 -- SR-MEI Dissociation
Solve Rate Does Not Equal Recovery
Claude-Sonnet-4.6 achieves the highest SR (60%) but ranks #8 on MEI (0.545). Claude-Sonnet-4.5 leads MEI (#1, 0.783) despite lower SR (36.7%). Solving more mazes does not imply better metacognitive error recovery.
F3 -- Newer ≠ Better
Newer Models Can Regress on Metacognition
Within the Claude 4.x family: Sonnet-4.5 (MEI=0.783) surpasses Sonnet-4.6 (MEI=0.545) despite being an older release. Haiku-4.5 (MEI=0.376) is comparable to Claude-3-Haiku (0.398). Version number is not a predictor of metacognitive quality.
F4 -- Recovery Over Accuracy
MEI Rewards Recovery, Not Completion
Claude-3.7-Sonnet ranks #1 with MEI=0.774 (SR=56.7%, HRR=87.5%). MiniMax-M2.5 ranks #4 despite the 2nd highest SR (53.3%). Task completion and metacognitive recovery are orthogonal; MEI captures the latter, consistent with Nelson & Narens (1990).
F5 -- Cross-Provider
Universal Across 7 Providers
The deficit spans Meta, Zhipu AI, MiniMax, Google, OpenAI, Anthropic, and Alibaba. No provider-specific training paradigm confers metacognitive advantage, ruling out provider artifacts as explanation.
What This Means
Why Random Walk Wins
A random walk never "believes" its moves will succeed -- it tries until one works, achieving perfect recovery. LLMs form confident beliefs that become anchored even under contradictory feedback, revealing a metacognitive anchoring bias.
Visualization

SR vs HRR Dissociation

The scatter plot reveals that Solve Rate and Hallucination Recovery Rate are decoupled. High SR does not imply high HRR, and vice versa.

Solve Rate (%) vs Hallucination Recovery Rate (%)
Solve Rate (%) HRR (%) 0 10 20 30 40 50 60 0 25 50 75 100 Random Walk: SR=100 HRR=100 GLM-4.7 Maverick MiniMax Scout Qwen Gemini Haiku (NEW) GPT-4o High SR Mid HRR Low SR, High HRR #1 on MEI
Benchmark Design

Maze Architecture

Generation
Randomized DFS
NxN grids generated via recursive backtracker. Sizes: 5x5 (17 optimal steps) and 7x7 (25 optimal steps). 2 mirage positions injected per maze at generation time. Step budget: N x N x 3.
Mirage Mechanics
Hallucination Traps
Mirage cells report wall=1 in the model's initial context, but the wall is actually passable. When the model attempts traversal, the environment reveals the true state. This creates a detectable contradiction that a metacognitively competent model should exploit.
Detection
Hallucination & Recovery Rules
Hallucination: model asserts confidence 60%+ for a direction AND environment rejects (or vice versa for mirages). Recovery: within 3 steps of hallucination, model successfully traverses the mirage passage.
Prompt Template
Model Input Format
Position: [r,c]. Walls: N/S/E/W = 0|1. History: last 5 steps. Output: JSON with direction, confidence (0-100), reasoning. Temperature: provider default. Max tokens: 256.
Extension

HalluCode: Coding Domain Transfer

HalluCode extends HalluMaze to the code generation domain. Instead of maze walls, the model is given deliberately false API hints (nonexistent methods, wrong signatures, deprecated calls). Metacognitive recovery = detecting the bad hint and writing correct code anyway.

20
Problems
3
Trap Types
2
Models Tested
n=39
Validated Trials

Middleware Ablation: MARL-SL vs AI Booster (H6, H7)

H6: MARL-SL effect reverses by model capacity. H7: AI Booster (Adversarial Priming) outperforms MARL-SL universally — a lighter-weight 2-step approach that works across model sizes.

ModelCapacityConditionCodeMEISRHRRDelta MEI
LFM-1.2B Small (1.2B) Baseline0.27468.4%0.0%
MARL-SL0.2155.0%25.0%-0.059
AI Booster (AP)0.37156.9%23.5%+0.097
GLM-4.5-Air Large (~7B+) Baseline0.57978.9%68.4%
MARL-SL0.737100.0%84.2%+0.158
AI Booster (AP)0.812100.0%82.4%+0.233

H6: MARL-SL capacity threshold lies between 1.2B and ~7B. H7: AI Booster (Adversarial Priming = explicit trap-awareness system prompt + 2-step VERIFY→CODE) beats MARL-SL for BOTH models. Effect is larger for the weaker model (+0.156 vs +0.075 over MARL-SL). AI results use valid-n=17 (2/19 rate-limit errors excluded).

Statistical Validation

Hypothesis Tests

All models vs Random Walk baseline. Wilcoxon signed-rank test with Bonferroni correction (k=10). All p<0.001 Bonferroni-corrected.

ModelnGlass's deltap (Bonferroni)Reject H0
Claude-Sonnet-4.5 †600.586<0.001Yes
Claude-3.7-Sonnet600.554<0.001Yes
GLM-4.7601.102<0.001Yes
Llama-4-Maverick601.254<0.001Yes
MiniMax-M2.5600.847<0.001Yes
Llama-4-Scout601.230<0.001Yes
Qwen-2.5-72B601.223<0.001Yes
Claude-Sonnet-4.6 †600.825<0.001Yes
Gemini-2.0-Flash-Lite601.557<0.001Yes
Claude-3-Haiku602.129<0.001Yes
GPT-4o-mini601.620<0.001Yes
Claude-Haiku-4.5 †601.965<0.001Yes
GPT-4o601.917<0.001Yes

† Claude 4.x family (same protocol, n=60 each)

Sensitivity
MEI Weight Robustness
625-configuration grid search (+/-50% per weight, 5 levels, 4 weights). Baseline > all LLMs in 100% of configurations. Ranking stability confirmed across full weight space.
Scope

Limitations & Future Work

No human baseline
Human performance is required to establish absolute scale. Planned: Prolific study with n>=25 participants using the same protocol.
DFS maze generation bias
DFS generates long corridors which may favor certain navigation strategies. Alternative algorithms (Kruskal, Wilson's) planned for structural diversity.
Qwen-2.5-72B complete (n=60)
Qwen-2.5-72B completed all 60 trials. MEI=0.559 [0.487, 0.629], SR=10.0%, HRR=60.7%, Cohen's d=1.223 (p<0.001).
Test-retest reliability not measured
ICC (Intraclass Correlation) for 3 models x 10 seeds x 2 runs planned. Target ICC > 0.8.
No 9x9 maze condition
Size scaling study (5x5, 7x7, 9x9) needed to characterize how metacognitive recovery degrades with problem complexity.
Ecological validity
Spearman correlation with TruthfulQA / HaluEval public scores needed to establish whether maze metacognition predicts real-world hallucination behavior.
Reference

Citation

If you use HalluMaze in your research, please cite:

@article{hallumaze2026, title = {HalluMaze: A Maze Navigation Benchmark for LLM Metacognitive Error Recovery}, author = {Anonymous}, year = {2026}, note = {Under review at NeurIPS 2026 Datasets \& Benchmarks Track}, url = {https://github.com/jaytoone/HalluMaze} }