Helix — Evaluation & Benchmarks¶

Status: Draft v1 · Last updated: 2026-06-18 · Related: Retrieval · Consolidation · PRD · Decisions

Helix is a local-first, coding-agent-first, portable, $0-default AI memory layer. A memory layer is only worth shipping if we can prove it makes an agent measurably better — more correct, faster, cheaper in tokens — without quietly poisoning itself. This document specifies how we measure memory quality: which external benchmarks we trust (and which we don't), the benchmark gap we intend to fill for coding agents, and the internal eval harness that gates every release.

Per ADR-027 (evaluation strategy), Helix standardizes on LongMemEval over LoCoMo as the primary external instrument, and commits to defining and publishing a coding-agent memory benchmark that does not currently exist. The rationale follows.

1. The benchmark landscape¶

There is no single benchmark that measures what Helix actually does. The conversational-memory benchmarks measure some of it (extraction, multi-session reasoning, knowledge updates), the coding benchmarks measure end-task correctness but assume zero cross-task memory, and vendor leaderboards are an actively contested mess. We triage them here.

1.1 LoCoMo — the most-cited, most-criticized¶

LoCoMo (Long-term Conversational Memory) is the de-facto reference benchmark: 1540 questions across four categories — single-hop, multi-hop, open-domain, and temporal — over long synthetic multi-session dialogues. It is the number everyone quotes.

It is also the number you should stop quoting. The Zep teardown (blog.getzep.com) documents structural flaws that make LoCoMo a weak discriminator of memory quality:

Documented flaw	Why it breaks the benchmark
Conversations are tiny	Individual conversations are only ~16K–26K tokens — well within a modern model's context window. A "long-term memory" benchmark that fits in context isn't testing memory.
A full-context baseline wins	Dumping the entire conversation into the prompt scores ~73% (J / LLM-judge) — beating specialized memory systems. If naïve full-context beats your memory layer, the benchmark isn't rewarding memory.
No knowledge-update questions	LoCoMo never asks "what is the current value of a fact that changed?" — so it cannot measure contradiction handling or staleness, the hardest and most valuable part of real memory.
Data-quality errors	Mislabeled and ambiguous gold answers inflate noise and let scoring methodology swing results by 20+ points.

The category we care about most — knowledge updates — is simply absent. We use LoCoMo only as a legacy sanity check, never as a quality signal.

1.2 Vendor LoCoMo numbers are contested — do not trust them¶

The published LoCoMo leaderboard is a vendor brawl, not a measurement. A non-exhaustive timeline:

Claim	Source
Mem0 ~66.9% LLM-judge, 91% lower p95 latency, ~90% token savings (~1.8K vs 26K tokens/conversation)	arxiv.org/abs/2504.19413, mem0.ai/research
Zep claims 75.14% (corrected methodology) vs Mem0's reported 66%	blog.getzep.com
Mem0 counters: Zep's own number drops 84% → 58.44% under corrected scoring	github.com/getzep/zep-papers/issues/5

The lesson is not "who is right" — it's that the same benchmark produces wildly different numbers depending on who runs the scoring, which is the signature of a benchmark that doesn't constrain its evaluation methodology. Helix policy: we do not cite vendor LoCoMo scores as evidence of anything, including our own. The one number worth retaining is the ~1.8K vs 26K token yardstick — not as a leaderboard rank, but as a target: memory must save tokens versus full-context (see §3).

1.3 LongMemEval — the better instrument¶

LongMemEval (ICLR 2025) is what LoCoMo should have been: 500 human-curated questions over realistic long histories, with LongMemEval_S at ~115K tokens and LongMemEval_M up to ~1.5M tokens — large enough that you cannot cheat with full-context (arxiv.org/abs/2410.10813). Crucially, it measures the five capabilities that map directly onto Helix's value proposition:

Capability	What it tests	LoCoMo?
Information extraction	Pull a specific fact stated once across a long history	partial
Multi-session reasoning	Synthesize facts spread across many sessions	partial
Temporal reasoning	Reason about when things happened / ordering	weak
Knowledge updates	Return the current value of a fact that changed; detect the contradiction	absent
Abstention	Refuse to answer when the information is genuinely absent	absent

The last two — knowledge updates and abstention — are exactly the capabilities LoCoMo ignores and exactly the failures that make a memory layer dangerous in production (stale facts, confident hallucination). LongMemEval is therefore our primary external benchmark per ADR-027.

1.4 MemBench & BEAM — operation coverage¶

LongMemEval is QA-shaped; it does not exercise the full operation surface of a memory system (write, update, delete, conflict resolution, forgetting). MemBench and BEAM extend coverage toward explicit memory operations (mem0.ai/blog/ai-memory-benchmarks-in-2026). We track these as secondary instruments to stress the edit/forget and contradiction paths that QA accuracy alone hides (the same paths our internal harness targets in §3).

2. The coding-agent memory eval GAP¶

Here is the whitespace. No mature memory benchmark exists for coding agents. Every serious coding benchmark — SWE-bench, RepoBench, LongCodeBench, SWE Context Bench, SWE-EVO, SWE-Bench-CL — evaluates each task as an independent episode with no memory carried between tasks. The agent solves issue N, the harness resets, and issue N+1 starts from a blank slate. The literature is blunt about it: "every coding benchmark treats tasks as independent episodes with no memory between them" (arxiv.org/html/2602.08316v3, arxiv.org/pdf/2507.00014, arxiv.org/pdf/2512.13564).

That stateless framing is precisely the thing a coding-agent memory layer is supposed to fix. A benchmark that resets between tasks cannot, by construction, measure the value of memory. So the headline question Helix exists to answer is unmeasured by anyone:

Does remembering project conventions, decisions, and prior mistakes across sessions reduce repeat mistakes and tokens on the next task?

2.1 Helix's whitespace: define & publish the benchmark¶

Helix will define and publish a coding-agent memory benchmark (working name: HelixCodeMem) that measures cross-task carryover rather than single-task solve rate. Design principles:

Task sequences, not task sets. Episodes are ordered within a repo so that earlier tasks establish conventions, decisions, and corrections that later tasks can exploit (or repeat-fail).
Memory-on vs memory-off A/B. Every sequence is run with Helix enabled and with a no-memory control. The benchmark's score is the delta, not the absolute.
Repeat-mistake rate. Did the agent re-make a mistake the human/agent already corrected in an earlier session (e.g., wrong import style, deprecated API, ignored ADR)?
Convention adherence. After a convention is established once (naming, error-handling pattern, test layout), is it followed in later tasks without being re-specified in the prompt?
Decision recall. Are prior architectural decisions (the project's own ADRs) respected instead of relitigated?
Token cost of carryover. Memory must reduce total tokens-to-solve across the sequence versus re-deriving context every task (ties to §3 tokens-per-retrieval).

The scoring axes mirror LongMemEval where they transfer (extraction, temporal/decision ordering, knowledge updates when a convention changes mid-sequence, abstention when a "remembered" convention does not actually apply). Publishing this is both an evaluation asset and a positioning asset: it is the eval that does not exist.

3. The internal eval harness¶

External benchmarks gate positioning; the internal harness gates releases. It runs in CI (§4) and produces the results table (§5). It has two layers — retrieval quality (did we surface the right memories?) and end-task quality (did surfacing them make the agent right?) — plus the hard-mode suites that distinguish a memory layer from a glorified cache.

3.1 Retrieval quality of surfaced memories¶

Against labeled (query, gold-memories) sets, we score the ranked list of memories Helix surfaces:

Metric	Measures	Why it matters
precision@k	fraction of surfaced memories that are relevant	poisoning the prompt with junk costs tokens and accuracy
recall@k	fraction of gold memories surfaced in top-k	a missed memory = a repeated mistake
MRR	rank of the first relevant memory	agents read top-down; position matters
nDCG	graded relevance, position-discounted	the realistic "some memories matter more" case

3.2 End-task correctness (LLM-as-judge + human audit)¶

Retrieval metrics are necessary but not sufficient — the only thing that ultimately matters is whether the downstream task came out right. We score end-task correctness with LLM-as-judge, explicitly acknowledging that judges are noisy: every judged run is calibrated against a human-audited subset, and we report judge–human agreement alongside the score. If judge/human agreement drops, the judge prompt is suspect, not the system under test.

3.3 Contradiction & knowledge-update handling¶

We seed timelines where a fact changes over time (e.g., "the project switched from REST to gRPC", "the lint rule was relaxed"). Scoring checks two things: (a) does retrieval return the current value, not a stale one, and (b) does the system detect and surface the contradiction rather than silently serving both. This is the LongMemEval "knowledge updates" capability, run on Helix's own store.

3.4 Edit/forget cascade correctness (incl. derived embeddings)¶

A delete is only real if everything derived from the record also disappears. The forget suite issues an erase and then verifies, via the provenance cascade (see Consolidation), that the source record and its derived embeddings, summaries, and consolidated abstractions are gone. We measure:

Residual leakage: can any derived artifact still surface the forgotten fact? (Target: zero.)
Cascade latency: wall-clock to fully propagate the deletion.

This is the suite that catches the classic memory-system bug: the row is deleted but the embedding/summary still answers the query.

3.5 Abstention / false-positive recall¶

Tied directly to poisoning resistance (§3.7). We measure false-positive recall: when the answer is genuinely not in memory, does Helix abstain, or does it confidently surface a plausible-but-wrong memory? High false-positive recall is how a memory layer becomes a hallucination amplifier.

3.6 Latency & tokens-per-retrieval (must BEAT full-context)¶

Metric	Target
Retrieval latency p50 / p95	tracked per release; regressions gate
Tokens-per-retrieval	must be strictly less than the full-context baseline

The token budget is non-negotiable and is the one number we inherit from the LoCoMo wars: Mem0's ~1.8K tokens vs ~26K full-context is the yardstick. If Helix surfaces memory that costs more tokens than just pasting the relevant context would, the memory layer is a net negative. "Memory should SAVE tokens" is an acceptance criterion, not an aspiration.

3.7 Adversarial poisoning suite (security)¶

Memory is a write-back attack surface: anything ingested can resurface later, with authority, in a future session. The poisoning suite ingests a labeled set of adversarial / poisoned inputs (prompt-injection-laden commits, malicious "conventions", fake decisions) and measures:

Reach: how many poisoned items pass ingestion filtering into durable memory?
Fire rate: of those, how many actually surface later and influence a downstream task?

Both are tracked over time; a regression here is a security regression, not a quality nit.

4. Methodology¶

The harness is only trustworthy if its inputs are labeled, its timelines are deterministic, and it runs automatically on every change — including in the configuration most users will actually run.

Labeled (query, gold-memories) sets. Retrieval metrics (§3.1) require human-labeled gold sets per scenario. These are versioned alongside the code so a metric movement is attributable to a code change, not a silent dataset change.
Seeded, changing-fact timelines. Knowledge-update and contradiction tests (§3.3) use deterministic seeds so that "the fact changed at session 4" is reproducible across runs and machines.
Regression gating in CI. Each suite has thresholds; a release that regresses precision@k/recall@k, end-task correctness, p95 latency, tokens-per-retrieval, forget-cascade leakage, abstention, or poisoning reach/fire fails the build. Benchmarks that don't gate CI rot.
The $0 / offline config is tested as first-class. Helix's default is local-first, $0, fully offline (local embeddings + local store). That configuration — not a hosted/premium variant — is the one the harness runs against by default. We never let the free/offline path silently degrade because the "real" tests ran against a paid backend.
Judge calibration. LLM-as-judge runs always carry a human-audited subset and report agreement (§3.2).

5. Results-tracking table template¶

One row per release, committed alongside the tag. Numbers are illustrative placeholders; Δ vs full-ctx and the memory-on/off delta are the load-bearing columns.

Date	Helix ver	Suite	precision@k	recall@k	MRR	nDCG	End-task (judge)	Judge↔human	Knowledge-update acc	Abstention (FP-recall)	p50 / p95 (ms)	Tokens/retrieval	Δ tokens vs full-ctx	Poison reach / fire	Gate
2026-06-18	0.1.0	LongMemEval_S	—	—	—	—	—	—	—	—	— / —	—	—	— / —	☐
2026-06-18	0.1.0	HelixCodeMem	—	—	—	—	—	—	—	—	— / —	—	—	— / —	☐
2026-06-18	0.1.0	$0/offline	—	—	—	—	—	—	—	—	— / —	—	—	— / —	☐

6. Opinionated decisions¶

#	Decision	Rationale
1	LongMemEval is primary; LoCoMo is legacy sanity only	LoCoMo fits in context, lacks knowledge-update Qs, and a full-context baseline beats memory systems on it (ADR-027).
2	Never cite vendor LoCoMo numbers — including ours	The same benchmark yields 58–84% depending on who scores; it's a brawl, not a measurement.
3	Keep the ~1.8K vs 26K token yardstick	Discard LoCoMo's ranking but keep its one durable number: memory must beat full-context on tokens.
4	Define & publish HelixCodeMem	Coding benchmarks are stateless by construction; the cross-task-memory eval simply doesn't exist — so we build it.
5	*Score the memory-on/off delta, not absolute solve rate*	The value of memory is the carryover; absolute solve rate hides it.
6	Tokens-per-retrieval beating full-context is an acceptance gate	A memory layer that costs more tokens than pasting context is a net negative — fail the build.
7	Forget = zero residual leakage, including derived embeddings	A delete that leaves a live embedding is a privacy and correctness bug; provenance cascade must be verified, not assumed.
8	Abstention/false-positive recall is a first-class metric	Confidently surfacing a wrong memory turns a memory layer into a hallucination amplifier.
9	LLM-as-judge is always backed by a human-audited subset	Judges are noisy; report judge↔human agreement or the score is uninterpretable.
10	The $0/offline config is the default test target	If the free/local path isn't the one CI gates, it will silently rot — and that's the path most users run.
11	Adversarial poisoning (reach + fire) is a CI security gate	Memory is a write-back attack surface; poisoned ingestion that resurfaces later is a vulnerability, not a quality nit.

Sources¶

Zep, "Lies, Damn Lies, and Statistics: Is Mem0 Really SOTA in Agent Memory?" (LoCoMo teardown; full-context baseline ~73% J; ~16–26K-token convos; data-quality errors) — https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/
Mem0 paper (LoCoMo ~66.9% LLM-judge, 91% lower p95, ~90% token savings, ~1.8K vs 26K tokens) — https://arxiv.org/abs/2504.19413
Mem0 research hub — https://mem0.ai/research
Zep ↔ Mem0 scoring dispute (84% → 58.44% corrected) — https://github.com/getzep/zep-papers/issues/5
LongMemEval (ICLR 2025): 500 Qs; _S ~115K / _M ~1.5M tokens; extraction, multi-session, temporal, knowledge-updates, abstention — https://arxiv.org/abs/2410.10813
Mem0, "AI Memory Benchmarks in 2026" (MemBench, BEAM; operation coverage) — https://mem0.ai/blog/ai-memory-benchmarks-in-2026
Coding benchmarks treat tasks as independent episodes (no cross-task memory) — https://arxiv.org/html/2602.08316v3 · https://arxiv.org/pdf/2507.00014 · https://arxiv.org/pdf/2512.13564