DESi
Falsification-oriented ablation · we report the failures too

Does a DESi epistemic-state slice help an LLM — and why?

We did not test whether DESi "wins". We tried to break the claim: full chat vs a DESi state slice vs a plausible-wrong, neutral, contradictory state; vs lexical and neural retrieval; with the governance metadata on and off; and with the state hand-curated vs auto-built by one extraction pass. Real runs, temperature 0, fixed seed, 4 models (Sonnet-4.5, GPT-4o, Llama-3.3-70B, Granite-4.1-8B).

Bottom line: correct, auto-constructible state selection is load-bearing — and it is mainly a token-efficiency + long-document robustness win over retrieval, not a general capability gain. Metadata governance is not established. The evaluator is paraphrase-blind, so these are relative comparisons; small N; no prompt was tuned to make any condition win.

selection load-bearing wrong state is toxic auto-constructible token-efficient on long docs no general capability gain metadata governance unproven router enforces the same metrics

1The condition ladder · recall across regimes

regime · modelABB_autoCFGR1R2nE
core short · sonnet0.710.880.020.020.020.460.88
lifecycle short · sonnet0.941.000.950.540.850.99
lifecycle short · gpt-4o0.850.990.950.490.711.00
lifecycle short · llama0.721.000.880.510.720.97
lifecycle LONG ~18k · sonnet0.881.000.960.520.081.00
lifecycle LONG ~18k · granite0.760.960.880.440.100.94

Recall (mean over reps/cases). = condition not run in that regime. Conditions: A full chat · B DESi slice · B_auto auto-built DESi · C wrong-slice · F empty · G neutral-irrelevant · R1 BM25 · R2n neural retrieval · E budget-matched no-metadata. C/F/G collapse everywhere; B / B_auto stay near 1.0.

2The five questions

3DESi vs retrieval — the most nuanced result

retriever · regimeB − retrieval (recall)reading
lexical BM25 / TF-IDF · short docs≈ +0.70lexical retrieval overstated DESi's edge
neural (bge-small) · budget-matched · short≈ +0.15 … +0.28a good dense retriever closes most of the gap
neural · optimal k ≈ all chunks · short≈ +0.05at unlimited budget, retrieval ≈ DESi
budget-matched · LONG ~18k docsBM25 +0.5 · neural +0.9retrieval collapses; DESi holds at ~2% of the tokens

A budget/k sweep is the key: neural recall ran 0.15 (k=1) → 0.83 (k≈13, 1× budget) → 0.95 (k≈18 ≈ all chunks). On short chats retrieval matches DESi if given enough k, which is cheap there — so DESi's short-doc edge is mostly token efficiency, not capability. On long documents "all chunks" doesn't fit the budget: at 1% of ~990 chunks the small embedder is fooled by generically-relevant filler (it retrieves "how is X set up" over the actual decision turns; BM25's keyword match is more robust), so retrieval collapses while DESi / B_auto stay near 1.0. Caveat: the neural collapse is for a small embedder + a generic query; a stronger embedder would rank needles better, so B−R1 ≈ +0.5 is the more representative long-doc gap.

4Token efficiency — the clearest practical win

On the ~18k-token documents, DESi (B) scored 1.0 / 0.96 (Sonnet / Granite) on ~372 input tokens, while the full chat scored 0.88 / 0.76 on ~18342 tokens — about 49× more. DESi gives better-than-full-context recall at ~1/49 of the cost, and the auto-built state nearly matches it. This is the most robust, least-caveated finding.

5What we conclude — and what we don't

✓ SUPPORTED

  • Correct, auto-constructible state selection is load-bearing.
  • Plausible-wrong / neutral / contradictory structure is actively toxic, sticky and confidently wrong — visible only via degeneration metrics.
  • DESi's edge over retrieval is token efficiency + long-document robustness.
  • Confidently-reporting-stale is mainly a retrieval pathology; a resolved state removes it.

✗ NOT CLAIMED

  • "DESi makes the model smarter" / "wins generally" — no general capability gain.
  • Metadata governance — B ≈ E everywhere; not established.
  • A large absolute gap vs strong retrieval — the lexical ~0.7 was an artefact.

◷ OPEN

  • A held-out extractor model for B_auto (avoid self-circularity).
  • A stronger neural retriever for the long-doc gap.
  • Harder multi-supersession chains; larger N (15–20+) to tighten CIs.

6From measurement to enforcement — the router governance layer

The ablation measures where an LLM degenerates without or with bad state. A deterministic layer in desi_router/governance/ turns those same degeneration metrics into router gates — DESi diagnoses, the router acts. It reads a read-only projection of the Layer-9 snapshot, picks one of eight epistemic modes, optionally guards the prompt, verifies the answer after the fact, and never mutates persistent state (Layer-9's gate stays the authority). Two artefacts, one vocabulary.

ablation measures (empirical rate)router verifier check (enforces)governance test
coherence_without_continuity = 0.80 @R2ncoherence_without_continuity (warns)test_coherence_without_continuity_warns…
confidence_while_wrong = 0.60 @R2nstale_confident_answer (blocks)test_stale_confident_answer_with_no_state_blocks
invalid-claim reuse (wrong-slice phases)invalid_claim_reuse (blocks)test_verifier_catches_invalid_claim_reuse
contradiction_persistence = 0.60 @A (granite)conflict_closure_without_evidence (blocks)test_open_conflict_closed_without_evidence
bad_framing_nonrecovery→ routes to recovery_modetest_high_poisoning_is_guarded_or_recovery

The routing mirrors the recall table: the R2n case (no state → collapse + degeneration) is exactly where select_mode refuses a blind answer — no usable state → retrieval_mode; risky state → guarded/recovery + a required verifier. 26 tests prove the gates fire on the measured failure modes (critical checks block, coherence only warns). Crucially, this does not re-open the metadata claim: B ≈ E still stands — the layer governs behaviour around the state, not extraction quality.

Methodology: real OpenRouter runs, temperature 0, fixed seed, 2–3 reps; conditions and metrics unit-tested; results store metrics only (no raw model text, no keys). Paraphrase/negation-blind evaluator → relative comparisons only; small N → directional. Full detail, per-phase PDFs and the code: ab_evidence/ · consolidated report (PDF) · governance ↔ ablation (PDF) · router governance (docs).