Skip to content

Helix — Retrieval Pipeline

Status: Draft v1 · Last updated: 2026-06-18 · Related: TSD · Memory Model · Consolidation · Cost · Decisions


1. Overview & Latency Budget

Retrieval is the hot path. Every other Helix subsystem (ingest, consolidation, graph maintenance) can be slow, async, and batched. Retrieval cannot. A coding agent calls recall() synchronously inside its reasoning loop, often several times per turn, and the latency stacks directly onto the user-visible response time.

The design therefore inverts the usual RAG cost model: all expensive work — LLM expansion, summarization, graph construction, embedding — is pushed to ingest time. The query path does pure CPU vector math, lexical lookup, and graph traversal. No LLM call ever touches the query hot path in the default tier (ADR-016).

Hard constraints (honored throughout this document):

Constraint Target Why
Latency p95 < 150 ms Synchronous agent loop; multiple recalls per turn
Compute CPU only Local-first; no GPU assumed
Cost $0 default No paid API on query path
LLM on query path None Pushed to ingest (doc2query, summaries, KG)
Scale 10⁵–10⁶ typed items Personal/project memory, not web-scale

Latency budget (p95, single recall, ~10⁶ items)

┌──────────────────────────────────────────────────────────────┐
│ TOTAL BUDGET: 150 ms (p95)                                     │
├──────────────────────────────────────────────────────────────┤
│ Stage                                  Budget   Cumulative     │
│ ─────────────────────────────────────  ──────   ──────────    │
│ 1. Query embed (int8, 384d)              8 ms      8 ms        │
│ 2. Scope / route (metadata filter)       3 ms     11 ms        │
│ 3a. Dense ANN top-100                    25 ms     36 ms       │
│ 3b. BM25 top-100 (parallel w/ 3a)       (15 ms)    36 ms       │
│ 4. RRF fuse (k=60)                        2 ms     38 ms       │
│ 5. Graph expansion (PPR / bounded BFS)   30 ms     68 ms       │
│ 6. Multi-signal ranking                   8 ms     76 ms       │
│ 7. MMR dedup / diversity                 10 ms     86 ms       │
│ 8. Token-budgeted packing                 6 ms     92 ms       │
│ ─────────────────────────────────────  ──────   ──────────    │
│ Headroom for GC / cold cache / I/O      ~58 ms    150 ms       │
└──────────────────────────────────────────────────────────────┘

The budget leaves ~40% headroom precisely because the killers — cross-encoder reranking and query-time LLM expansion — are excluded from the default tier and only available in opt-in tiers (§6, §7). A top-100 CPU cross-encoder rerank alone is 88–257 s, roughly 600–1700× the entire budget (Speed Showdown); it is structurally impossible at p95<150ms and is never default.


2. The Default Pipeline

                          query string
                               │
            ┌──────────────────▼───────────────────┐
            │  1. QUERY EMBED  (bge-small int8 384d)│
            └──────────────────┬───────────────────┘
                               │
            ┌──────────────────▼───────────────────┐
            │  2. SCOPE / ROUTE                     │
            │  embedding router → {skip? · type ·  │
            │  project · time window · ACL}         │
            └──────────────────┬───────────────────┘
                               │   (skip-retrieval → empty set)
            ┌──────────────────▼───────────────────┐
            │  3. HYBRID RETRIEVE  (parallel)       │
            │  ┌───────────┐      ┌──────────────┐  │
            │  │ DENSE ANN │      │  BM25 / sparse│ │
            │  │ top-100   │      │  top-100      │  │
            │  └─────┬─────┘      └──────┬───────┘  │
            └────────┼───────────────────┼──────────┘
                     └─────────┬─────────┘
            ┌──────────────────▼───────────────────┐
            │  4. RRF FUSE  k=60  → ranked top-100  │
            └──────────────────┬───────────────────┘
                               │  (seed set)
            ┌──────────────────▼───────────────────┐
            │  5. GRAPH EXPANSION                    │
            │  Personalized PageRank over typed KG  │
            │  (HippoRAG-style) OR bounded 1–2 hop  │
            │  → adds multi-hop neighbors           │
            └──────────────────┬───────────────────┘
                               │  (~120–160 candidates)
            ┌──────────────────▼───────────────────┐
            │  6. MULTI-SIGNAL RANKING              │
            │  sim · recency · salience · conf ·    │
            │  graph-proximity  (weights per type)  │
            └──────────────────┬───────────────────┘
                               │
            ┌──────────────────▼───────────────────┐
            │  7. MMR DEDUP / DIVERSITY  λ=0.5      │
            │  (MinHash-LSH ~0.8 pre-dedup)         │
            └──────────────────┬───────────────────┘
                               │
            ┌──────────────────▼───────────────────┐
            │  8. TOKEN-BUDGETED PACKING            │
            │  lost-in-the-middle ordering          │
            └──────────────────┬───────────────────┘
                               │
                       ranked memory bundle

Stage 1 — Query embed

The query is embedded once with the same model used at ingest (default bge-small-en-v1.5, 384d, int8; §6). int8 query embedding keeps this under ~8 ms on CPU. The full-precision query vector is retained in memory for the optional int8/binary rescore pass (§6) so we never pay accuracy loss on the asymmetric query side.

Stage 2 — Scope / route

A cheap embedding router (~10× cheaper than LLM routing: 50–200 ms LLM vs sub-10 ms here) classifies the query against precomputed centroids to decide (router cost): - skip-retrieval — Self-RAG-style; if the query needs no memory (e.g. pure arithmetic, "format this JSON"), return empty and save the whole pipeline (Self-RAG). - scope filters — memory type(s), project, time window, ACL — applied as pre-filters on the ANN index so dense/BM25 only search the relevant partition.

Stage 3 — Hybrid retrieve (dense + BM25), top-100 each

Dense ANN and BM25 run in parallel. Both are mandatory, not optional: - Dense captures semantic paraphrase. DPR beats BM25 78.4% vs 59.1% top-20 on NQ (DPR). - BM25 is non-negotiable for personal/code memory. Dense retrievers degrade out-of-domain while BM25 generalizes (BEIR), and personal memory is saturated with proper nouns, file paths, identifiers, and verbatim error strings that demand exact lexical match. SPLADE is an optional sparse-semantic upgrade (SPLADE).

Stage 4 — RRF fuse, k=60

Fuse the two ranked lists with Reciprocal Rank Fusion:

score(d) = Σ  1 / (k + rank_i(d))         k = 60
          i∈{dense, bm25}

RRF uses ranks only, sidestepping the fragility of normalizing two incomparable score distributions (RRF explained). k=60 is the original Cormack SIGIR 2009 value (Cormack RRF) and matches Elasticsearch's rank_constant=60. Implementation note: Qdrant uses a zero-based k=2 convention — always validate the rank-base per engine before assuming 60 (ADR-018).

Stage 5 — Graph expansion

The RRF top-100 becomes the seed set for a single graph step over Helix's typed knowledge graph (built entirely at ingest). Two modes: - Personalized PageRank (default, HippoRAG-style) — run PPR with the seed set as the personalization vector; this performs single-step multi-hop association in one classic, cheap CPU algorithm (HippoRAG). HippoRAG2 lifts 2Wiki Recall@5 from 76.5→90.4 (HippoRAG2). - Bounded traversal (fallback) — for very large graphs or tight budgets, a 1–2 hop BFS capped at N neighbors per seed.

We explicitly reject Microsoft GraphRAG for the default path: its per-chunk + community-summary LLM indexing is ruinously expensive and global search issues ~40K-token prompts (GraphRAG); even LazyGraphRAG only cuts indexing cost (LazyGraphRAG). The whole point of HippoRAG-PPR is that all LLM cost lives at ingest, and the query is pure linear algebra. LightRAG (<100 tokens/query, incremental insert; LightRAG) and Graphiti (bi-temporal invalidation, ~300 ms P95; Graphiti) inform the ingest-side graph design — see Memory Model.

Stage 6 — Multi-signal ranking

The fused + expanded candidate pool (~120–160 items) is scored by the blended formula in §3.

Stage 7 — MMR dedup / diversity

Maximal Marginal Relevance removes near-duplicates and diversifies (§8).

Stage 8 — Token-budgeted packing

Pack to the caller's token budget using lost-in-the-middle ordering (§8).


3. The Ranking Formula

Each signal is min-max normalized within the candidate pool to [0,1], then linearly combined. Normalize-then-blend is the standard, robust approach used by Mem0, which fuses vector + BM25 + entity signals after normalization (Mem0).

score(d, q) =  w_sim   · sim(d, q)            // cosine, normalized
             + w_rec   · recency(d)           // exp decay from LAST ACCESS
             + w_sal   · salience(d)           // importance, set at ingest
             + w_conf  · confidence(d)         // source/verification trust
             + w_graph · graphprox(d, q)       // PPR mass / inverse hop dist
Signal Source Notes
sim cosine(query, doc) The relevance backbone. From RRF-fused rank → score.
recency exp(-λ · Δt) Decays from last access, not creation (LangChain TimeWeighted). Generative Agents use 0.995/hr (Generative Agents).
salience importance score, ingest-time Generative Agents use an LLM 1–10 importance; Helix computes it at ingest so the query path stays LLM-free.
confidence source trust / verification E.g. a user-confirmed decision > a speculative inference.
graphprox PPR mass or 1/hop-distance Rewards items pulled in by Stage 5 association.

Weights are tunable per memory type. The generative-agents baseline sets recency = importance = relevance = 1 (Generative Agents); Helix generalizes this so each type can rebalance:

Memory type sim recency salience confidence graph Rationale
Episodic (events, sessions) 0.35 0.30 0.15 0.10 0.10 Time matters most
Semantic (facts, entities) 0.40 0.05 0.20 0.20 0.15 Recency nearly irrelevant; trust matters
Procedural (how-tos, runbooks) 0.45 0.10 0.25 0.15 0.05 Relevance + importance dominate
Decision/ADR 0.35 0.10 0.25 0.25 0.05 Confidence weighted heavily
Code/path refs 0.50 0.15 0.10 0.10 0.15 Exact relevance + graph links

Defaults ship as config; types and weights are overridable. See Memory Model for type definitions.


4. Embeddings (ADR-017)

All embeddings computed at ingest and stored quantized. The query path embeds once and compares against the quantized index, with optional full-precision rescore on a small top-k.

Default tier ($0, CPU, fast)

bge-small-en-v1.5 — 384d, 33M params, MTEB 62.17 — the best-in-class small model and the Helix default (bge-small). Stored int8 (4× memory reduction, ~99.3% quality retained with rescore; quantization).

Model dim params MTEB Tier
all-MiniLM-L6-v2 384 22.7M ~56 (legacy/tiny)
bge-small-en-v1.5 384 33M 62.17 DEFAULT
gte-small 384 61.36 alt small
e5-small 384 59.93 alt small
nomic-embed-text-v1.5 768 62.28 MRL, 8K ctx
bge-base-en-v1.5 768 63.55 mid
mxbai-embed-large-v1 1024 64.68 upgrade
arctic-embed-l-v2.0 1024 (multilingual) upgrade

Upgrade tier (higher quality, still CPU-feasible)

mxbai-embed-large-v1 (1024d, MTEB 64.68; mxbai) or arctic-embed-l-v2.0 (1024d, multilingual; Arctic 2), both Matryoshka-trained (MRL). Use MRL truncation to claw back the cost of the larger model: - Matryoshka truncation — arctic-m@256d retains ~99% quality; nomic@256d loses only 1.24 MTEB (Arctic MRL, nomic MRL). Store a truncated prefix for the ANN scan, full dim for rescore. - Binary quantization — 32× memory reduction and up to 32× faster search, ~96% quality retained with a rescore pass (quantization). This is what makes a 1024d model viable at 10⁶ items on CPU: binary ANN scan → int8/float rescore of top-k.

Rule of thumb: binary for the first-pass scan, int8/full for rescore. Never compare quantized vectors without a rescore stage on the survivors.


5. Optional High-Quality Rerank Tier

Reranking is off by default because a naïve top-100 CPU cross-encoder is 88 s (bge-base) to 257 s (bge-v2-m3), 65–195× slower than GPU (Speed Showdown) — impossible at 150 ms. Two escapes make reranking feasible on CPU when the user opts in:

Approach Mechanism CPU feasibility Quality
answerai-colbert-small-v1 Late interaction (token-level), 33M Marketed for ms-scale CPU search over 100Ks of docs (ColBERT-small); PLAID = 45× faster ColBERT on CPU (PLAID) BEIR 53.79
int8 ONNX cross-encoder, top-20–30 only Quantized cross-encoder over a tiny survivor set Feasible only if k ≤ 20–30 (CE efficiency) High

For reference, full cross-encoders score BEIR 53.94 (bge-reranker-v2-m3, 0.6B; bge-rerank) and 57.49 (mxbai-rerank-large-v2; mxbai-rerank) — excellent but GPU/cloud territory.

Helix rule: the only CPU-default-tier rerank option is ColBERT late interaction. The int8 ONNX cross-encoder is allowed only on the top-20–30, and any heavier reranker is pushed to the async/cloud tier (see Cost).


6. Query Understanding

The dividing line is index-time vs query-time. Index-time expansion is latency-safe and $0-on-query; query-time LLM expansion violates the no-LLM-on-hot-path constraint.

Technique When Effect Verdict
doc2query / docTTTTTquery Ingest Recall@1000 85.3→89.3 at ~0 query cost (docTTTTTquery) DEFAULT — generate expansion queries at ingest, index them
Embedding router Query ~10× cheaper than LLM routing (sub-10 ms vs 500–2000 ms) DEFAULT — Stage 2 routing
Skip-retrieval Query Self-RAG: skip when no memory needed (Self-RAG) DEFAULT — Stage 2
HyDE Query TREC DL19 61.3 vs 44.5, but loses to in-domain fine-tuning and adds +25–60% latency (HyDE) async/cloud tier only
Query2doc Query +15 nDCG but >2000 ms/query (Query2doc) async/cloud tier only
Multi-query Query Multiple LLM rewrites async/cloud tier only

Net: Helix gets the recall benefit of expansion via doc2query at ingest and the routing benefit via a cheap embedding router — both compatible with the budget. HyDE/Query2doc/multi-query are real wins but their latency (and LLM cost) confines them to the opt-in async/cloud tier described in Cost.


7. "Lost in the Middle" Packing

LLM accuracy on a relevant item drops from 75.8% → 53.8% when that item sits in the middle of a long context (Lost in the Middle). Packing order is therefore a first-class ranking concern, not an afterthought.

  context window (token budget)
  ┌──────────────────────────────────────────────────────┐
  │ rank 1  │ rank 3 │ rank 5 │ … │ rank 6 │ rank 4 │ rank 2│
  │ (BEST)  │        │        │   │        │        │(2nd)  │
  └──────────────────────────────────────────────────────┘
     START  ◄─── most salient at BOTH ends ───►   END
                   (weakest items buried mid)

Pre-packing dedup & diversity: 1. MinHash-LSH near-dup removal at Jaccard threshold ~0.8 before any semantic merge. 2. MMR for diversity (MMR):

MMR = argmax [ λ · sim(d, q) − (1−λ) · max sim(d, dₛₑₗ) ]      λ = 0.5
       d∉S                          dₛₑₗ∈S
  1. Selective, budget-aware packing. Don't pack everything. Mem0's selective approach yields +26% on LoCoMo, 91% lower p95, and ~90% fewer tokens (Mem0). Fewer, better, well-ordered memories beat a stuffed context — this also directly serves Cost.

Packing algorithm: sort survivors by final score → place rank 1 at start, rank 2 at end, rank 3 next-to-start, rank 4 next-to-end, … (outside-in interleave) until the token budget is exhausted.


8. Twelve Opinionated Decisions

# Decision Why ADR
1 No LLM on the query hot path, ever (default tier) Only way to hit p95<150ms at $0 ADR-016
2 Hybrid dense + BM25 is mandatory, not optional Personal/code memory is full of paths, IDs, error strings; BM25 generalizes where dense degrades (BEIR) ADR-016
3 RRF fusion, k=60, ranks-only Avoids score-normalization fragility (RRF); validate rank-base per engine ADR-016
4 HippoRAG Personalized PageRank for graph hop Multi-hop in one cheap CPU pass; all LLM cost at ingest (HippoRAG) ADR-016
5 Reject Microsoft GraphRAG for default Ruinous LLM indexing + 40K-token global queries (GraphRAG) ADR-016
6 Cross-encoder reranking OFF by default 88–257 s on CPU for top-100 (Speed Showdown) ADR-016
7 If rerank, use ColBERT late-interaction (CPU-ms) or int8 ONNX CE on top-20–30 only Only CPU-feasible rerank paths (ColBERT-small, CE efficiency) ADR-016
8 bge-small-en-v1.5 384d int8 as default embedder Best small MTEB (62.17); int8 = 4× mem, ~99.3% w/ rescore (bge-small, quant) ADR-017
9 Upgrade tier uses MRL truncation + binary quantization 32× mem, up to 32× faster, ~96% w/ rescore; makes 1024d viable at 10⁶ (quant) ADR-017
10 Min-max normalize each signal, then linear blend; weights per type Robust fusion (Mem0); per-type recency/confidence balance (Mem0) ADR-016
11 Recency decays from LAST ACCESS Reinforces re-used memories (LangChain) ADR-016
12 Expansion at ingest (doc2query); HyDE/multi-query async-only doc2query +Recall at ~0 query cost; HyDE/Q2D cost >2000 ms (docTTTTTquery, Query2doc) ADR-016

9. Failure Modes

Failure mode Symptom Mitigation
Dense-only blind spot Misses exact path/ID/error-string queries BM25 leg is mandatory (decision #2); never disable sparse
Quantization drift Binary/int8 ANN returns wrong neighbors Always rescore top-k with int8/full vectors before ranking
RRF rank-base bug Silent quality loss from wrong k convention Pin and test k; validate zero- vs one-based per engine (Qdrant k=2)
Graph over-expansion PPR floods pool with weak neighbors, dilutes precision Cap PPR mass / hop count; graphprox weight is small for semantic types
Cold ANN cache First query of a session blows the budget Warm index on session start; budget has ~58 ms headroom
Lost-in-the-middle Relevant memory present but ignored by LLM Outside-in packing; most-salient at both ends (Lost in the Middle)
Over-packing High tokens, high latency, lower accuracy Selective packing; cap memory count (Mem0)
Recency runaway Stale-but-important facts buried by fresh noise Per-type weights: low recency weight for semantic/decision types
Skip-retrieval false negative Router skips when memory was needed Conservative router threshold; cheap to retrieve, costly to miss
Near-dup spam Same fact repeated N times wastes budget MinHash-LSH ~0.8 pre-dedup + MMR diversity
Reranker accidentally enabled on CPU p95 explodes to tens of seconds Default off; guard rail caps CE input to top-20–30
Out-of-domain dense collapse Dense recall tanks on novel jargon BM25 floor + ingest-time doc2query expansion

Sources

  • DPR — Dense Passage Retrieval: https://arxiv.org/abs/2004.04906
  • BEIR — heterogeneous IR benchmark: https://arxiv.org/abs/2104.08663
  • SPLADE / sparse-semantic encoders: https://opensearch.org/blog/improving-document-retrieval-with-sparse-semantic-encoders/
  • Cormack et al., Reciprocal Rank Fusion (SIGIR 2009): https://cormack.uwaterloo.ca/cormacksigir09-rrf.pdf
  • RRF — how it works / when to use: https://bigdataboutique.com/blog/reciprocal-rank-fusion-how-it-works-and-when-to-use-it
  • Reranker speed showdown (CPU vs GPU): https://medium.com/@xiweizhou/speed-showdown-reranker-1f7987400077
  • answer.ai — small-but-mighty ColBERT: https://www.answer.ai/posts/2024-08-13-small-but-mighty-colbert.html
  • PLAID — efficient late interaction: https://arxiv.org/abs/2205.09707
  • SBERT cross-encoder efficiency: https://sbert.net/docs/cross_encoder/usage/efficiency.html
  • bge-reranker-v2-m3: https://huggingface.co/BAAI/bge-reranker-v2-m3
  • mxbai-rerank-v2: https://www.mixedbread.com/blog/mxbai-rerank-v2
  • Microsoft GraphRAG: https://arxiv.org/abs/2404.16130
  • LazyGraphRAG: https://www.microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost/
  • HippoRAG: https://arxiv.org/abs/2405.14831
  • HippoRAG2: https://arxiv.org/abs/2502.14802
  • LightRAG: https://arxiv.org/abs/2410.05779
  • Graphiti: https://arxiv.org/abs/2501.13956
  • HyDE: https://arxiv.org/abs/2212.10496
  • Query2doc: https://arxiv.org/abs/2303.07678
  • docTTTTTquery / doc2query: https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf
  • Self-RAG: https://arxiv.org/abs/2310.11511
  • bge-small-en-v1.5: https://huggingface.co/BAAI/bge-small-en-v1.5
  • nomic-embed Matryoshka: https://www.nomic.ai/news/nomic-embed-matryoshka
  • mxbai-embed-large-v1: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
  • Snowflake Arctic Embed 2.0: https://www.snowflake.com/en/engineering-blog/snowflake-arctic-embed-2-multilingual/
  • Arctic Embed (MRL): https://github.com/Snowflake-Labs/arctic-embed
  • Embedding quantization (int8/binary): https://huggingface.co/blog/embedding-quantization
  • Generative Agents (memory stream): https://ar5iv.labs.arxiv.org/html/2304.03442
  • LangChain TimeWeightedVectorStore: https://python.langchain.com/v0.2/docs/how_to/time_weighted_vectorstore/
  • Mem0: https://arxiv.org/abs/2504.19413 · https://arxiv.org/html/2504.19413v1
  • Lost in the Middle: https://arxiv.org/abs/2307.03172
  • MMR (Carbonell & Goldstein): https://dl.acm.org/doi/10.1145/290941.291025