All 13 tested LLMs score significantly below a random walk on metacognitive recovery. Newer models are not better: Claude-Sonnet-4.5 (MEI=0.783) surpasses 4.6 (0.545). SR and MEI are orthogonal.
We introduce HalluMaze, a benchmark that measures large language model (LLM) metacognitive error recovery through maze navigation. Unlike existing hallucination benchmarks that evaluate final-answer accuracy, HalluMaze captures real-time error detection and corrective action by exposing models to navigable environments containing "mirage" walls -- passages that appear blocked but are traversable. We evaluate 13 LLMs (Claude-Sonnet-4.5, Claude-3.7-Sonnet, GLM-4.7, Llama-4-Maverick, MiniMax-M2.5, Llama-4-Scout, Qwen-2.5-72B, Claude-Sonnet-4.6, Gemini-2.0-Flash-Lite, Claude-3-Haiku, GPT-4o-mini, Claude-Haiku-4.5, GPT-4o) across n=60 seeds per model on 5x5 and 7x7 mazes. We introduce the Metacognitive Escape Index (MEI), grounded in Nelson & Narens' (1990) metamemory framework, which decomposes metacognitive performance into recovery rate (HRR), efficiency (ETR), awareness (AW), and error rate (HR). All 13 tested LLMs score significantly below a random walk baseline (p<0.001 Bonferroni-corrected, d=0.6–2.1 for all), revealing a systematic deficit in real-time metacognitive recovery. Critically, SR and MEI are orthogonal: Claude-Sonnet-4.6 has the highest SR (60%) but ranks #8 on MEI. Newer models are not better: Sonnet-4.5 (MEI=0.783) surpasses Sonnet-4.6 (0.545).
Ranked by MEI (Metacognitive Escape Index). All LLMs tested with n=60 seeds across 5x5 and 7x7 mazes. Higher MEI = better metacognitive recovery. SR (Solve Rate) and MEI are orthogonal — a model can win more mazes while recovering from errors less reliably.
| Rank | Model | MEI | Grade | SR (Solve Rate) | HRR (Recovery) | n |
|---|---|---|---|---|---|---|
| -- | Random WalkDeterministic baseline | 0.900[0.900, 0.900] | A | 100.0% | 100.0% | -- |
| -- | A* OracleDeterministic baseline | 0.900[0.900, 0.900] | A | 100.0% | 100.0% | -- |
| 1 | Claude-Sonnet-4.5Anthropic (OpenRouter)NEW | 0.783[0.732, 0.829] | B | 36.7% | 89.2% | 60 |
| 2 | Claude-3.7-SonnetAnthropic (OpenRouter) | 0.774[0.715, 0.830] | B | 56.7% | 87.5% | 60 |
| 3 | GLM-4.7Zhipu AI | 0.615[0.551, 0.681] | B | 8.3% | 71.8% | 60 |
| 4 | Llama-4-MaverickMeta (OpenRouter) | 0.600[0.541, 0.660] | B | 13.3% | 81.1% | 60 |
| 5 | MiniMax-M2.5MiniMax | 0.593[0.500, 0.682] | B | 53.3% | 60.0% | 60 |
| 6 | Llama-4-ScoutMeta (OpenRouter) | 0.589[0.525, 0.649] | B | 8.3% | 81.0% | 60 |
| 7 | Qwen-2.5-72BAlibaba (OpenRouter) | 0.559[0.488, 0.629] | B | 10.0% | 60.7% | 60 |
| 8 | Claude-Sonnet-4.6Anthropic (OpenRouter)NEW | 0.545[0.440, 0.649] | B | 60.0% | 58.3% | 60 |
| 9 | Gemini-2.0-Flash-LiteGoogle (OpenRouter) | 0.432[0.352, 0.507] | D | 8.3% | 40.3% | 60 |
| 10 | Claude-3-HaikuAnthropic (OpenRouter) | 0.398[0.341, 0.457] | D | 5.0% | 36.3% | 60 |
| 11 | GPT-4o-miniOpenAI (OpenRouter) | 0.391[0.310, 0.467] | D | 5.0% | 38.2% | 60 |
| 12 | Claude-Haiku-4.5Anthropic (OpenRouter)NEW | 0.376[0.312, 0.446] | D | 5.0% | 38.3% | 60 |
| 13 | GPT-4oOpenAI (OpenRouter) | 0.315[0.239, 0.394] | F | 6.7% | 35.3% | 60 |
v1.21: +Claude-Sonnet-4.5 (#1, MEI=0.783) +Claude-Sonnet-4.6 (#8, MEI=0.545, SR=60%) +Claude-Haiku-4.5 (#12). 13 models total, n=780 trials. SR (Solve Rate) and MEI are orthogonal: Sonnet-4.6 has the highest SR (60%) but ranks #8 on MEI. Newer ≠ better metacognition. Grade scale: A (0.8+) / B (0.55+) / C (0.45+) / D (0.35+) / F (<0.35). † Claude 4.x family.
MEI decomposes metacognitive performance into four components grounded in Nelson & Narens' (1990) metamemory framework.
Grid search over 625 configurations (5 levels per weight, +/-50%) confirms that baseline > LLM MEI ranking is stable in 100% of tested configurations. This empirically validates the weight choices independent of theoretical claims.
The scatter plot reveals that Solve Rate and Hallucination Recovery Rate are decoupled. High SR does not imply high HRR, and vice versa.
HalluCode extends HalluMaze to the code generation domain. Instead of maze walls, the model is given deliberately false API hints (nonexistent methods, wrong signatures, deprecated calls). Metacognitive recovery = detecting the bad hint and writing correct code anyway.
H6: MARL-SL effect reverses by model capacity. H7: AI Booster (Adversarial Priming) outperforms MARL-SL universally — a lighter-weight 2-step approach that works across model sizes.
| Model | Capacity | Condition | CodeMEI | SR | HRR | Delta MEI |
|---|---|---|---|---|---|---|
| LFM-1.2B | Small (1.2B) | Baseline | 0.274 | 68.4% | 0.0% | — |
| MARL-SL | 0.215 | 5.0% | 25.0% | -0.059 | ||
| AI Booster (AP) | 0.371 | 56.9% | 23.5% | +0.097 | ||
| GLM-4.5-Air | Large (~7B+) | Baseline | 0.579 | 78.9% | 68.4% | — |
| MARL-SL | 0.737 | 100.0% | 84.2% | +0.158 | ||
| AI Booster (AP) | 0.812 | 100.0% | 82.4% | +0.233 |
H6: MARL-SL capacity threshold lies between 1.2B and ~7B. H7: AI Booster (Adversarial Priming = explicit trap-awareness system prompt + 2-step VERIFY→CODE) beats MARL-SL for BOTH models. Effect is larger for the weaker model (+0.156 vs +0.075 over MARL-SL). AI results use valid-n=17 (2/19 rate-limit errors excluded).
All models vs Random Walk baseline. Wilcoxon signed-rank test with Bonferroni correction (k=10). All p<0.001 Bonferroni-corrected.
| Model | n | Glass's delta | p (Bonferroni) | Reject H0 |
|---|---|---|---|---|
| Claude-Sonnet-4.5 † | 60 | 0.586 | <0.001 | Yes |
| Claude-3.7-Sonnet | 60 | 0.554 | <0.001 | Yes |
| GLM-4.7 | 60 | 1.102 | <0.001 | Yes |
| Llama-4-Maverick | 60 | 1.254 | <0.001 | Yes |
| MiniMax-M2.5 | 60 | 0.847 | <0.001 | Yes |
| Llama-4-Scout | 60 | 1.230 | <0.001 | Yes |
| Qwen-2.5-72B | 60 | 1.223 | <0.001 | Yes |
| Claude-Sonnet-4.6 † | 60 | 0.825 | <0.001 | Yes |
| Gemini-2.0-Flash-Lite | 60 | 1.557 | <0.001 | Yes |
| Claude-3-Haiku | 60 | 2.129 | <0.001 | Yes |
| GPT-4o-mini | 60 | 1.620 | <0.001 | Yes |
| Claude-Haiku-4.5 † | 60 | 1.965 | <0.001 | Yes |
| GPT-4o | 60 | 1.917 | <0.001 | Yes |
† Claude 4.x family (same protocol, n=60 each)
If you use HalluMaze in your research, please cite: