Does a DESi epistemic-state slice help an LLM — and why?
We did not test whether DESi "wins". We tried to break the claim: full chat vs a DESi state slice vs a plausible-wrong, neutral, contradictory state; vs lexical and neural retrieval; with the governance metadata on and off; and with the state hand-curated vs auto-built by one extraction pass. Real runs, temperature 0, fixed seed, 4 models (Sonnet-4.5, GPT-4o, Llama-3.3-70B, Granite-4.1-8B).
Bottom line: correct, auto-constructible state selection is load-bearing — and it is mainly a token-efficiency + long-document robustness win over retrieval, not a general capability gain. Metadata governance is not established. The evaluator is paraphrase-blind, so these are relative comparisons; small N; no prompt was tuned to make any condition win.
selection load-bearing wrong state is toxic auto-constructible token-efficient on long docs no general capability gain metadata governance unproven router enforces the same metrics
1The condition ladder · recall across regimes
| regime · model | A | B | B_auto | C | F | G | R1 | R2n | E |
|---|---|---|---|---|---|---|---|---|---|
| core short · sonnet | 0.71 | 0.88 | — | 0.02 | 0.02 | 0.02 | 0.46 | — | 0.88 |
| lifecycle short · sonnet | 0.94 | 1.00 | 0.95 | — | — | — | 0.54 | 0.85 | 0.99 |
| lifecycle short · gpt-4o | 0.85 | 0.99 | 0.95 | — | — | — | 0.49 | 0.71 | 1.00 |
| lifecycle short · llama | 0.72 | 1.00 | 0.88 | — | — | — | 0.51 | 0.72 | 0.97 |
| lifecycle LONG ~18k · sonnet | 0.88 | 1.00 | 0.96 | — | — | — | 0.52 | 0.08 | 1.00 |
| lifecycle LONG ~18k · granite | 0.76 | 0.96 | 0.88 | — | — | — | 0.44 | 0.10 | 0.94 |
Recall (mean over reps/cases). — = condition not run in that regime. Conditions: A full chat · B DESi slice · B_auto auto-built DESi · C wrong-slice · F empty · G neutral-irrelevant · R1 BM25 · R2n neural retrieval · E budget-matched no-metadata. C/F/G collapse everywhere; B / B_auto stay near 1.0.
2The five questions
- Q1 · Is correct state selection load-bearing? Yes (strong). Wrong-slice C, empty F and neutral G all collapse to ~0 recall while B ≈ 1.0 — across every regime and all four models. A plausible-but-wrong state is no better than no state, and worse on degeneration.
- Q2 · Does a wrong state poison the model? Yes — and it hides in degeneration, not recall. On recall F≈G≈C; the degeneration metrics separate them: C adopts the wrong case's claims (invalid-reuse ≈ 9–15), G is pulled into irrelevant content, contradictions make the model loop. Injected claims persist and even relapse across "double-check" probes (dropped when challenged, then reverted). The model is confidently wrong (self-rated 93–100 while recall ≈ 0).
- Q3 · Does the governance metadata (typing / IDs) help? Not established. Budget-matched B ≈ E in every phase and model — the typing adds no measurable recall over the same texts at the same budget. The value is the selection, not the metadata.
- Q4 · Does DESi beat retrieval? Yes — but the honest size depends on the retriever and the document length. See §3.
- Q5 · Can the state be auto-constructed? Yes. One LLM extraction pass that resolves supersession yields B_auto ≈ B (within noise; worst single case 0.80) on all models, and it survives long documents (0.88–0.96 at ~18k tokens) — even with the small Granite model. The "we only tested curated ground truth" caveat is substantially closed.
3DESi vs retrieval — the most nuanced result
| retriever · regime | B − retrieval (recall) | reading |
|---|---|---|
| lexical BM25 / TF-IDF · short docs | ≈ +0.70 | lexical retrieval overstated DESi's edge |
| neural (bge-small) · budget-matched · short | ≈ +0.15 … +0.28 | a good dense retriever closes most of the gap |
| neural · optimal k ≈ all chunks · short | ≈ +0.05 | at unlimited budget, retrieval ≈ DESi |
| budget-matched · LONG ~18k docs | BM25 +0.5 · neural +0.9 | retrieval collapses; DESi holds at ~2% of the tokens |
A budget/k sweep is the key: neural recall ran 0.15 (k=1) → 0.83 (k≈13, 1× budget) → 0.95 (k≈18 ≈ all chunks). On short chats retrieval matches DESi if given enough k, which is cheap there — so DESi's short-doc edge is mostly token efficiency, not capability. On long documents "all chunks" doesn't fit the budget: at 1% of ~990 chunks the small embedder is fooled by generically-relevant filler (it retrieves "how is X set up" over the actual decision turns; BM25's keyword match is more robust), so retrieval collapses while DESi / B_auto stay near 1.0. Caveat: the neural collapse is for a small embedder + a generic query; a stronger embedder would rank needles better, so B−R1 ≈ +0.5 is the more representative long-doc gap.
4Token efficiency — the clearest practical win
On the ~18k-token documents, DESi (B) scored 1.0 / 0.96 (Sonnet / Granite) on ~372 input tokens, while the full chat scored 0.88 / 0.76 on ~18342 tokens — about 49× more. DESi gives better-than-full-context recall at ~1/49 of the cost, and the auto-built state nearly matches it. This is the most robust, least-caveated finding.
5What we conclude — and what we don't
✓ SUPPORTED
- Correct, auto-constructible state selection is load-bearing.
- Plausible-wrong / neutral / contradictory structure is actively toxic, sticky and confidently wrong — visible only via degeneration metrics.
- DESi's edge over retrieval is token efficiency + long-document robustness.
- Confidently-reporting-stale is mainly a retrieval pathology; a resolved state removes it.
✗ NOT CLAIMED
- "DESi makes the model smarter" / "wins generally" — no general capability gain.
- Metadata governance — B ≈ E everywhere; not established.
- A large absolute gap vs strong retrieval — the lexical ~0.7 was an artefact.
◷ OPEN
- A held-out extractor model for B_auto (avoid self-circularity).
- A stronger neural retriever for the long-doc gap.
- Harder multi-supersession chains; larger N (15–20+) to tighten CIs.
6From measurement to enforcement — the router governance layer
The ablation measures where an LLM
degenerates without or with bad state. A deterministic layer in
desi_router/governance/
turns those same degeneration metrics into router gates — DESi diagnoses, the router acts. It
reads a read-only projection of the Layer-9 snapshot, picks one of eight epistemic modes, optionally
guards the prompt, verifies the answer after the fact, and never mutates persistent state
(Layer-9's gate stays the authority). Two artefacts, one vocabulary.
| ablation measures (empirical rate) | router verifier check (enforces) | governance test |
|---|---|---|
| coherence_without_continuity = 0.80 @R2n | coherence_without_continuity (warns) | test_coherence_without_continuity_warns… |
| confidence_while_wrong = 0.60 @R2n | stale_confident_answer (blocks) | test_stale_confident_answer_with_no_state_blocks |
| invalid-claim reuse (wrong-slice phases) | invalid_claim_reuse (blocks) | test_verifier_catches_invalid_claim_reuse |
| contradiction_persistence = 0.60 @A (granite) | conflict_closure_without_evidence (blocks) | test_open_conflict_closed_without_evidence |
| bad_framing_nonrecovery | → routes to recovery_mode | test_high_poisoning_is_guarded_or_recovery |
The routing mirrors the recall table: the R2n
case (no state → collapse + degeneration) is exactly where select_mode refuses a blind
answer — no usable state → retrieval_mode; risky state → guarded/recovery + a
required verifier. 26 tests prove the gates fire on the measured failure modes (critical checks
block, coherence only warns). Crucially, this does not re-open the metadata claim: B ≈ E still
stands — the layer governs behaviour around the state, not extraction quality.
Methodology: real OpenRouter runs, temperature 0, fixed seed, 2–3 reps; conditions and metrics unit-tested; results store metrics only (no raw model text, no keys). Paraphrase/negation-blind evaluator → relative comparisons only; small N → directional. Full detail, per-phase PDFs and the code: ab_evidence/ · consolidated report (PDF) · governance ↔ ablation (PDF) · router governance (docs).