Falsification-oriented ablation · we report the failures too

Does a DESi epistemic-state slice help an LLM — and why?

We did not test whether DESi "wins". We tried to break the claim: full chat vs a DESi state slice vs a plausible-wrong, neutral, contradictory state; vs lexical and neural retrieval; with the governance metadata on and off; and with the state hand-curated vs auto-built by one extraction pass. Real runs, temperature 0, fixed seed, 4 models (Sonnet-4.5, GPT-4o, Llama-3.3-70B, Granite-4.1-8B).

Bottom line: correct, auto-constructible state selection is load-bearing — and it is mainly a token-efficiency + long-document robustness win over retrieval, not a general capability gain. Metadata governance is not established. The evaluator is paraphrase-blind, so these are relative comparisons; small N; no prompt was tuned to make any condition win.

selection load-bearing wrong state is toxic auto-constructible token-efficient on long docs no general capability gain metadata governance unproven router enforces the same metrics

1The condition ladder · recall across regimes

regime · model	A	B	B_auto	C	F	G	R1	R2n	E
core short · sonnet	0.71	0.88	—	0.02	0.02	0.02	0.46	—	0.88
lifecycle short · sonnet	0.94	1.00	0.95	—	—	—	0.54	0.85	0.99
lifecycle short · gpt-4o	0.85	0.99	0.95	—	—	—	0.49	0.71	1.00
lifecycle short · llama	0.72	1.00	0.88	—	—	—	0.51	0.72	0.97
lifecycle LONG ~18k · sonnet	0.88	1.00	0.96	—	—	—	0.52	0.08	1.00
lifecycle LONG ~18k · granite	0.76	0.96	0.88	—	—	—	0.44	0.10	0.94

Recall (mean over reps/cases). — = condition not run in that regime. Conditions: A full chat · B DESi slice · B_auto auto-built DESi · C wrong-slice · F empty · G neutral-irrelevant · R1 BM25 · R2n neural retrieval · E budget-matched no-metadata. C/F/G collapse everywhere; B / B_auto stay near 1.0.

2The five questions

Q1 · Is correct state selection load-bearing? Yes (strong). Wrong-slice C, empty F and neutral G all collapse to ~0 recall while B ≈ 1.0 — across every regime and all four models. A plausible-but-wrong state is no better than no state, and worse on degeneration.
Q2 · Does a wrong state poison the model? Yes — and it hides in degeneration, not recall. On recall F≈G≈C; the degeneration metrics separate them: C adopts the wrong case's claims (invalid-reuse ≈ 9–15), G is pulled into irrelevant content, contradictions make the model loop. Injected claims persist and even relapse across "double-check" probes (dropped when challenged, then reverted). The model is confidently wrong (self-rated 93–100 while recall ≈ 0).
Q3 · Does the governance metadata (typing / IDs) help? Not established. Budget-matched B ≈ E in every phase and model — the typing adds no measurable recall over the same texts at the same budget. The value is the selection, not the metadata.
Q4 · Does DESi beat retrieval? Yes — but the honest size depends on the retriever and the document length. See §3.
Q5 · Can the state be auto-constructed? Yes. One LLM extraction pass that resolves supersession yields B_auto ≈ B (within noise; worst single case 0.80) on all models, and it survives long documents (0.88–0.96 at ~18k tokens) — even with the small Granite model. The "we only tested curated ground truth" caveat is substantially closed.

3DESi vs retrieval — the most nuanced result

retriever · regime	B − retrieval (recall)	reading
lexical BM25 / TF-IDF · short docs	≈ +0.70	lexical retrieval overstated DESi's edge
neural (bge-small) · budget-matched · short	≈ +0.15 … +0.28	a good dense retriever closes most of the gap
neural · optimal k ≈ all chunks · short	≈ +0.05	at unlimited budget, retrieval ≈ DESi
budget-matched · LONG ~18k docs	BM25 +0.5 · neural +0.9	retrieval collapses; DESi holds at ~2% of the tokens

A budget/k sweep is the key: neural recall ran 0.15 (k=1) → 0.83 (k≈13, 1× budget) → 0.95 (k≈18 ≈ all chunks). On short chats retrieval matches DESi if given enough k, which is cheap there — so DESi's short-doc edge is mostly token efficiency, not capability. On long documents "all chunks" doesn't fit the budget: at 1% of ~990 chunks the small embedder is fooled by generically-relevant filler (it retrieves "how is X set up" over the actual decision turns; BM25's keyword match is more robust), so retrieval collapses while DESi / B_auto stay near 1.0. Caveat: the neural collapse is for a small embedder + a generic query; a stronger embedder would rank needles better, so B−R1 ≈ +0.5 is the more representative long-doc gap.

4Token efficiency — the clearest practical win

On the ~18k-token documents, DESi (B) scored 1.0 / 0.96 (Sonnet / Granite) on ~372 input tokens, while the full chat scored 0.88 / 0.76 on ~18342 tokens — about 49× more. DESi gives better-than-full-context recall at ~1/49 of the cost, and the auto-built state nearly matches it. This is the most robust, least-caveated finding.

5What we conclude — and what we don't

✓ SUPPORTED

Correct, auto-constructible state selection is load-bearing.
Plausible-wrong / neutral / contradictory structure is actively toxic, sticky and confidently wrong — visible only via degeneration metrics.
DESi's edge over retrieval is token efficiency + long-document robustness.
Confidently-reporting-stale is mainly a retrieval pathology; a resolved state removes it.

✗ NOT CLAIMED

"DESi makes the model smarter" / "wins generally" — no general capability gain.
Metadata governance — B ≈ E everywhere; not established.
A large absolute gap vs strong retrieval — the lexical ~0.7 was an artefact.

◷ OPEN

A held-out extractor model for B_auto (avoid self-circularity).
A stronger neural retriever for the long-doc gap.
Harder multi-supersession chains; larger N (15–20+) to tighten CIs.

6From measurement to enforcement — the router governance layer

The ablation measures where an LLM degenerates without or with bad state. A deterministic layer in desi_router/governance/ turns those same degeneration metrics into router gates — DESi diagnoses, the router acts. It reads a read-only projection of the Layer-9 snapshot, picks one of eight epistemic modes, optionally guards the prompt, verifies the answer after the fact, and never mutates persistent state (Layer-9's gate stays the authority). Two artefacts, one vocabulary.

ablation measures (empirical rate)	router verifier check (enforces)	governance test
coherence_without_continuity = 0.80 @R2n	coherence_without_continuity (warns)	`test_coherence_without_continuity_warns…`
confidence_while_wrong = 0.60 @R2n	stale_confident_answer (blocks)	`test_stale_confident_answer_with_no_state_blocks`
invalid-claim reuse (wrong-slice phases)	invalid_claim_reuse (blocks)	`test_verifier_catches_invalid_claim_reuse`
contradiction_persistence = 0.60 @A (granite)	conflict_closure_without_evidence (blocks)	`test_open_conflict_closed_without_evidence`
bad_framing_nonrecovery	→ routes to recovery_mode	`test_high_poisoning_is_guarded_or_recovery`

The routing mirrors the recall table: the R2n case (no state → collapse + degeneration) is exactly where select_mode refuses a blind answer — no usable state → retrieval_mode; risky state → guarded/recovery + a required verifier. 26 tests prove the gates fire on the measured failure modes (critical checks block, coherence only warns). Crucially, this does not re-open the metadata claim: B ≈ E still stands — the layer governs behaviour around the state, not extraction quality.

Methodology: real OpenRouter runs, temperature 0, fixed seed, 2–3 reps; conditions and metrics unit-tested; results store metrics only (no raw model text, no keys). Paraphrase/negation-blind evaluator → relative comparisons only; small N → directional. Full detail, per-phase PDFs and the code: ab_evidence/ · consolidated report (PDF) · governance ↔ ablation (PDF) · router governance (docs).