Temporal Tree Memory: Structured Organization Unlocks Multi-Hop Reasoning in AI Memory Systems

Christian Matsoukis

Research · June 2026

Temporal Tree Memory: Structured Organization Unlocks Multi-Hop Reasoning in AI Memory Systems

Christian Matsoukis
AiVery Systems

Revision note (June 2026). An earlier version of this paper reported an LLM judge score of 0.82 (Claude Sonnet, wide retrieval). A reproduction audit — re-running the original corpus, answering model, judge, retrieval stack, and configuration — could not reproduce that figure; it lands at ~0.73 under the paper's own GPT-4.1-mini judge. We conclude the 0.82 was a measurement artifact of a superseded scoring run, for which no reproducible evidence survives. This version reports the reproducible result and re-leads with the relative comparison to mem0 — the paper's substantive, fully reproducible contribution. Absolute ablation magnitudes that depended on the superseded run are marked (superseded — pending re-measurement). The architecture's qualitative findings and per-category wins over mem0 are unchanged, and in several categories larger than previously reported.

Abstract

We present Temporal Tree Memory, a memory architecture for AI agents in which every stored fact is assigned a position in a tree at write time. The tree topology encodes temporal ordering (root to leaf), fact refinement (parent to child), and contradiction (forking branches), making retrieval a branch traversal problem rather than a flat nearest-neighbor search. A complementary entity heatmap layer surfaces cross-entity memories for multi-hop queries using intersection-count ordering over entity metadata.

On the LOCOMO conversational memory benchmark (1,540 questions across 10 multi-session conversations), our system achieves an LLM judge score of 0.7253 (Claude Sonnet, wide retrieval), compared to 0.6383 for mem0 on the same evaluation harness and the same judge (GPT-4.1-mini) — an improvement of +0.087 overall. The gains are concentrated exactly where the architecture is designed to help: temporal reasoning (+0.595 over mem0) and multi-hop reasoning (+0.156), the two categories that most depend on memory structure. Single-hop recall is at parity (+0.007); open-domain remains below mem0 (−0.088). A systematic ablation indicates a clear contribution hierarchy — memory structure and retrieval width dominate, while reranker identity (+0.002) and reasoning model are marginal. The dominant finding is qualitative and robust: memory structure and retrieval coverage account for nearly all the improvement over flat retrieval; the choice of reranker and reasoning model is secondary. (The absolute magnitudes of the wide-retrieval ablation steps are being re-measured following a scoring-pipeline correction; see §4.2 and the revision note above.)

1. Introduction

An AI agent that can remember is more useful than one that cannot. This is the straightforward observation behind a growing class of memory systems for LLM agents — systems that store facts from conversations, retrieve them at query time, and inject them into the model's context. The architecture of these systems has converged on a common pattern: store memories as flat vector embeddings, retrieve by cosine similarity, and optionally apply post-hoc clustering to add some structure. Systems such as mem0, MemGPT, Zep, and LangMem explore different points in this design space: persistent fact extraction, virtual context management, temporal knowledge graphs, and reflection-based long-term memory tooling [1–4].

The problem with this pattern is that it treats memory as a storage and retrieval problem when it is fundamentally a structure problem. Human memory is not a flat database — it is organized around time, causality, and entity relationships. The fact that "Caroline got a new job" is understood in the context of what her previous job was, who she told first, and what else was happening in her life at that time. Flat retrieval can return the fact; it cannot return the context. And without the context, temporal reasoning ("what was true before X?"), multi-hop reasoning ("how did event A lead to event B?"), and contradiction detection ("what did she believe before she changed her mind?") all fail.

We propose Temporal Tree Memory: a memory architecture in which every fact is placed into a tree at write time, with the tree topology encoding time (root to leaf), refinement (parent to child), and contradiction (forking branches). Retrieval becomes branch traversal: find the right entry point via semantic search, then walk the tree to recover the history and refinements of each retrieved fact. A secondary entity heatmap layer ensures that memories connecting multiple named entities are surfaced for multi-hop queries, using intersection-count ordering rather than cosine similarity to rank cross-entity facts.

On the LOCOMO conversational memory benchmark [5], our system achieves an LLM judge score of 0.7253 (Claude Sonnet-4.6 with wide retrieval), compared to 0.6383 for mem0 on the same harness and the same judge (GPT-4.1-mini) — an improvement of +0.087 overall. The gains are concentrated exactly where the architecture is designed to help: temporal reasoning (+0.595 over mem0) and multi-hop reasoning (+0.156). A systematic ablation indicates a clear contribution hierarchy: memory structure and retrieval width dominate, while reranker identity (+0.002) and reasoning model are marginal. The reranker identity and reasoning model contributions are dwarfed by structure and retrieval coverage — the practical implication is that memory architecture and retrieval breadth dominate the choice of reranker or reasoning model for most conversational memory applications.

The remainder of the paper is organized as follows. Section 2 surveys related work. Section 3 describes the four architectural components. Section 4 presents the evaluation on LOCOMO including the full ablation and comparison to mem0. Section 5 discusses the mechanisms behind the key findings and the open-domain regression. Section 6 describes future directions.

2. Related Work

Memory systems for LLM agents

The dominant pattern in LLM memory systems is flat vector storage with periodic post-hoc structuring. mem0 stores salient information from conversations as a scalable memory layer and uses extraction, consolidation, and retrieval to maintain agent memory [1]. Write-time deduplication runs an LLM classifier to suppress near-duplicate facts. Where mem0-style systems separate the stored memory artifacts from later consolidation/retrieval structure, our temporal tree makes structure the primary fact about a memory: placement in the tree is determined at write time, and the tree topology is the organization, not an approximation of it derived later.

MemGPT [2] frames memory management as a virtual context problem: the LLM context window is treated as RAM and external storage as disk, with explicit instructions for paging memories in and out on overflow. MemGPT's strength is context management under token limits; its memories are flat documents with no structural relationship to one another. It does not address temporal reasoning or multi-hop retrieval as architectural goals.

Zep [3] maintains a temporal knowledge graph over agent memories, using a bi-temporal data model that tracks both when a fact was valid and when it was recorded. Zep's graph explicitly encodes time and relationships, which is philosophically aligned with our approach. The structural difference is that Zep's graph has typed edges between arbitrary nodes (person-WORKS_AT-company), while our tree encodes a single recursive relationship (refinement / update / temporal succession) that is simpler to populate automatically but sufficient to recover temporal context for any retrieved fact. Zep's graph requires named entity recognition and relationship extraction at write time; our tree requires only a cosine nearest-neighbor lookup.

LangMem [4] provides tooling for extracting information from conversations, optimizing agent behavior, and maintaining long-term memory via reflection-based summarization. An LLM reads a window of recent memories and produces summaries that can replace the originals. This reduces memory footprint but compresses away the original facts; questions about specific prior states become unanswerable once the source records are overwritten. Our system preserves all written memories as tree nodes, making the full history of any fact available via ancestor traversal.

Benchmarks for conversational memory

LOCOMO [5] is the benchmark we evaluate on: 10 multi-session conversations, four question categories (single-hop, temporal, multi-hop, open-domain), and a held-out unanswerable set. Questions are answered by a system with access to the stored memories; answers are judged by BLEU, token F1, and an LLM judge. We use LLM judge as our primary metric (see §4.5 for justification). LOCOMO is designed to stress temporal reasoning and multi-hop retrieval, which are the categories most likely to differ across memory architectures — making it an appropriate benchmark for our structural claims.

BEAM [6] is a more recent benchmark targeting long-horizon multi-session memory at larger scale (100K–1M token corpora). BEAM is harder than LOCOMO primarily through its scale: the relevant memory may be hundreds of turns from the query, and categories like event_ordering and multi_session_reasoning require finding and correctly ordering facts that are far apart in time. We evaluated a pilot run of our system on BEAM and treat full BEAM evaluation as future work (§6).

Reranking in retrieval-augmented generation

Two-stage retrieval, where a first-stage retriever produces candidates and a more expressive second-stage model reranks them, is standard in dense retrieval and RAG-style systems [7,8]. Cross-encoders read the query and each candidate jointly, allowing them to bridge the query-statement semantic gap that bi-encoders approximate. In the RAG literature, MS MARCO-trained cross-encoders [9] and Cohere's hosted rerank API [10] are the most commonly used. Prior work applies reranking to flat retrieval pools; we apply it downstream of tree expansion, so the cross-encoder receives a candidate pool that already contains structured neighborhood context rather than cosine-only results. As our ablation shows (§4.2), the identity of the reranker contributes +0.002 in isolation; its value is amplified by the quality of the candidates it receives.

3. Architecture

The core design principle of Temporal Tree Memory is that structure should be primary, not post-hoc. Existing memory systems store facts as flat records and apply clustering periodically as a retrieval optimization. We invert this: every memory is placed into a tree at write time, and the tree topology is the clustering. Related memories are parent and child, time flows root-to-leaf, and contradictions fork.

This section describes the four components that realize this principle: the temporal tree (§3.1), the entity heatmap (§3.2), the async validation pipeline that populates the tree without blocking writes (§3.3), and the reranking layer that operates downstream of retrieval (§3.4).

3.1 Temporal Tree Structure

The foundational change is a single schema addition: a parent_id self-referential foreign key on the memories table. A memory with parent_id = null is a topic root — an entry point into a subject domain. A memory with a non-null parent_id is a child: a refinement, update, elaboration, or contradiction of its parent. The tree is built incrementally as memories are written, so at any point in time the topology reflects the full history of what an agent has learned about each topic.

Tree traversal. Four recursive CTE queries are the retrieval primitives:

GetAncestorsAsync(id, maxDepth=10) — walks up toward the root; returns the history leading to any fact

GetDescendantsAsync(rootId, maxDepth=5) — walks down; returns all refinements of a concept

GetTreeContextAsync(seedIds, maxDepth=2) — breadth-first expansion from multiple entry points simultaneously; used in retrieval to expand top-K semantic results into their surrounding subtrees

GetTemporalNeighborsAsync(id, window=2) — returns the 2 memories before and 2 after a given node by created_at within the same agent; used only when IsTemporalQuery detects a time-oriented question

Context expansion. The /memory/context endpoint takes the top semantic results and expands each into a subtree: ancestors up to depth 2, children to depth 1. This produces a context that is not a list of independent facts but a structured neighborhood — the topic lineage above a memory and its refinements below. The expansion is always active; temporal sequential expansion (conversation-order neighbors) is gated on IsTemporalQuery to prevent multi-hop noise.

Parent assignment. When a new memory is validated, the validator derives a SuggestedParentId from the conflicts it detects and the nearest-neighbor similarity of the new memory to existing ones. The priority chain is: refinement relationship > update/contradiction relationship > cosine similarity above threshold > null (new root). Parent assignment uses a two-threshold design: a tight provisional threshold (0.85) applied at write time, and a relaxed threshold (0.60) used by the background validator when a semantic relationship is detected. The write-time threshold is deliberately tight to prevent early or general memories from accumulating many children and becoming hubs; the validator's lower threshold allows related-but-distinct memories (cosine 0.62–0.68) that would otherwise become orphan roots to be correctly placed as tree siblings once the semantic relationship is confirmed. A per-node child cap (12 children) prevents any single node from becoming a hub regardless of similarity scores. This means the tree topology encodes semantic relationships explicitly: a memory that refines an existing fact becomes its child; a memory that contradicts it forks from the same parent; a new topic starts a new root.

Tree expansion in retrieval. The RetrieveAsync path includes a tree expansion pass after the initial vector+lexical candidate pool is assembled. The top-5 scoring candidates are used as entry points; their subtrees (up to depth 2) are traversed via GetTreeMemberIdsAsync, and any uncovered descendant IDs are added to the candidate pool with a discounted score (average pool score × 0.75). This surfaces closely related memories that the embedding search missed — refinements and updates that are semantically nearby but not nearest-neighbor matches — while preventing expansion candidates from displacing strong direct matches.

Recency-aware retrieval. Queries asking about the current state of a fact ("What is \[person]'s job now?", "Where does \[person] live?") require preferring the most recent version of a memory over the highest-cosine match. A regex classifier detects current-state query intent and, when triggered, applies a recency weight override (0.35, versus the base 0.03) in the scoring function. The 0.35 value was selected empirically: inspection of retrieval failure cases showed that a weight below ~0.20 was insufficient to overcome cosine dominance for near-duplicate memories, while weights above ~0.40 caused recency to overwhelm semantic relevance entirely. We do not report classifier accuracy on a held-out set; this is a known limitation and future work should validate the regex against an annotated query sample. The override shifts ranking toward recently created memories without changing the candidate pool, addressing the failure mode where an outdated version of a fact ranks above its successor because its embedding is marginally closer to the query.

3.2 Entity Heatmap

The entity heatmap is a second retrieval layer that runs in parallel with cosine-based semantic retrieval. Its purpose is to ensure that memories about named entities mentioned in the query are represented in the context, even if their cosine similarity to the query is below the top-K threshold.

Mechanism. The system maintains a per-organization cache of known entity names (5-minute TTL). When a context request arrives, the query is matched against this cache using word-boundary string matching to extract mentioned entities. Two pools of entity memories are then fetched via GetByNamedEntitiesAsync:

Hot memories (cap 25): memories in parent groups already activated by the semantic retrieval results. These are entities that the cosine search already found relevant; the heatmap adds depth within those branches.

Cold memories (cap 5): memories tagged with queried entities but not yet in any activated branch. These provide entity coverage that cosine similarity missed.

Intersection ordering. GetByNamedEntitiesAsync sorts results by intersection count — memories tagged with multiple queried entities rank first. This is the load-bearing mechanism for multi-hop reasoning: a question that asks about the relationship between two entities (e.g., "Where did \[person] work when \[event] happened?") will surface memories that connect both entities near the top of the entity pool, before memories that relate to only one.

This ordering is what Cohere reranking cannot replicate on its own: the intersection-count sort operates on structured entity metadata, not on semantic similarity. As our ablation shows (§4.2), disabling this in favor of pure vector search causes multi-hop to regress by 0.02 even when the total candidate count is unchanged.

Relationship to lexical retrieval. The entity heatmap is superficially similar to BM25 — both surface memories relevant to query keywords — but they differ in three ways that matter for multi-hop reasoning. First, BM25 scores by lexical overlap between query tokens and memory text; the heatmap operates on structured entity metadata extracted at write time by the enrichment pipeline. A memory recorded as "she secured a position at Meridian" scores zero in BM25 for the query "Caroline's job" because neither token appears in the text, but the heatmap finds it because the enrichment pipeline tagged it with the entity Caroline. Second, BM25 has no privileged notion of multi-entity intersection — it scores by combined term frequency across all query tokens. The heatmap's intersection-count ordering explicitly ranks memories tagged with multiple queried entities above memories tagged with only one, which is the structural signal that enables cross-entity retrieval. Third, the hot/cold pool split distinguishes memories already activated by semantic search (where the heatmap adds depth) from those missed by cosine similarity entirely (where it adds coverage). BM25 treats all candidates identically regardless of what the vector stage found.

3.3 Async Validation and Maintenance Pipeline

The validation pipeline is responsible for two things: assigning parent IDs (tree placement) and rejecting duplicates. In earlier versions, this ran synchronously on the write path, adding one LLM call per memory. The async pipeline moves this off the hot path entirely.

Write flow. When a memory is written, the system computes the embedding and stores the record with validation_status = pending_validation, then immediately returns. Two jobs are enqueued on a Postgres-backed job queue: a high-priority validation job and a medium-priority enrichment job.

Background worker. BackgroundJobWorkerService polls the background_jobs table every 2 seconds using FOR UPDATE SKIP LOCKED for safe concurrent dequeue. The validation handler calls the LLM classifier, which assigns one of six relationship labels (duplicate | contradiction | update | refinement | uncertain | no_relation) against the nearest existing memories. If the result is duplicate, the pending memory is marked stale. Otherwise, parent_id is set according to the priority chain described in §3.1. The enrichment handler extracts named entities and atomic facts and writes them to memory_entities and memory_facts.

Production benefit. Writes return in under 200ms. LLM validation, which can take 1–3 seconds, runs in the background. The tree structure converges within seconds of a write completing.

Deterministic graph maintenance. A separate background pass — run on demand after a period of agent inactivity — applies deterministic graph hygiene: marking stale memories, merging duplicates, cleaning orphaned edges, and refreshing cluster labels. This pass uses no LLM calls and is idempotent; it can be re-run after any batch ingestion to consolidate the tree structure.

3.4 Reranking Layer

Cosine similarity between a query embedding and a memory embedding has a fundamental limitation for conversational memory retrieval: questions are phrased as queries ("What job did she get?") and memories are stored as statements ("Caroline secured employment at a tech startup in the fall"). A bi-encoder assigns these similar but not identical representations, and at K=50 with thousands of memories per agent, the coverage is only \~3–5%.

A cross-encoder reads the query and each candidate memory jointly, allowing it to bridge this query-statement gap directly. We implement this as a pluggable IRerankingService with a Cohere rerank-v3.5 backend (and a no-op fallback for self-hosted deployments).

Integration. Reranking operates on the full candidate pool assembled by the semantic retrieval and entity heatmap — after tree expansion, before context is returned to the caller. This means the reranker sees the richer, tree-expanded candidate set rather than the raw cosine-only results. Cohere receives the query and a list of candidate memory strings and returns a relevance-ranked ordering; the API returns the top limit memories from that ordering.

Wide retrieval. To give the reranker more candidates to choose from, the API internally retrieves limit × RetrieveMultiplier candidates from cosine + entity heatmap before reranking to limit. With RetrieveMultiplier = 4 and limit = 50, the cosine stage retrieves 200 candidates and Cohere selects the top 50. This decouples the initial retrieval width from the final context size.

4. Evaluation

4.1 Benchmark Setup

We evaluate on LOCOMO [5], a conversational memory benchmark consisting of 10 multi-session conversations between two speakers. Each conversation spans multiple sessions over simulated time and is accompanied by a question set covering four categories: single-hop (direct factual recall, 282 questions), temporal (time-relative reasoning, 321 questions), multi-hop (connecting facts across two entities or events, 96 questions), and open-domain (broad knowledge about the participants, 841 questions). Category 5 (unanswerable questions, 446 items) is excluded from all runs, yielding 1,540 evaluated questions.

Ingestion. Each conversation is ingested as a single agent (conversation_0 through conversation_9), treating both speakers as a shared memory space. This matches the production model where a single agent accumulates memories from both sides of a conversation. Memories are extracted using GPT-4.1-mini and stored via the async write pipeline described in §3.3. Full LLM validation and deduplication produces approximately 1,200–1,400 memories per conversation after the background jobs complete.

Retrieval and answering. At evaluation time, each question is answered by passing it to the Cortex agent, which calls /memory/context with limit=50 and RetrieveMultiplier=4 (200 candidates retrieved, Cohere reranks to 50). The agent answers using the returned context. A 0.5-second delay between questions prevents rate limiting.

Evaluation. Answers are evaluated using three metrics: BLEU, token F1, and an LLM judge score. The LLM judge (GPT-4.1-mini) receives the question, the gold answer, and the system's answer and scores correctness on a 0–1 scale. We report LLM judge score as the primary metric (see §4.5).

Failure classification. For incorrect answers, we run a secondary LLM classifier that determines the failure mode: retrieval_failure (the correct answer was not in the top-K retrieved memories), reasoning_failure (the correct answer was present in context but the model failed to use it), or extraction_failure (the fact exists in the database but was never stored correctly during ingestion).

4.2 Ablation Grid

Scoring correction. The wide-retrieval and full-feature-stack rows of this grid were originally scored at ~0.80–0.82. A reproduction audit could re-derive only the wide-K + Claude Sonnet configuration, which reproduces at 0.7253 under the GPT-4.1-mini judge (not 0.8227). The other wide-K and full-stack absolute scores are marked (superseded — pending re-measurement) — they are not reproducible from surviving artifacts and should not be cited. The K=50 rows are shown at their original values (not re-verified). The contribution hierarchy the ablation establishes — structure and retrieval width dominate; reranker and reasoning model are marginal — is robust and does not depend on the superseded absolute magnitudes.

Configuration	llm_score	Delta / status
Flat retrieval, ms-marco reranker	0.5552	— (original run; not re-verified)
Flat retrieval, Cohere wide-K (200→50), gpt-4.1-mini (reproduced)	0.5571	+0.002 vs ms-marco; architecture contributes ~+0.18, reranker ~+0.18
Tree + entity heatmap, ms-marco	0.6669	+0.112 vs flat ms-marco (not re-verified)
Tree + entity heatmap, Cohere K=50, gpt-4.1-mini	0.6773	+0.122 vs flat Cohere (not re-verified)
Tree + entity heatmap, Cohere wide-K (200→50), gpt-4.1-mini	~~0.8000~~	(superseded — pending re-measurement)
Tree + entity heatmap, Cohere K=50, Claude Sonnet	0.6851	+0.008 vs gpt-4.1-mini at same width (not re-verified)
Tree + entity heatmap, Cohere wide-K, Claude Sonnet (reproduced)	0.7253	reproducible headline; +0.087 vs mem0 (0.6383), GPT-4.1-mini judge
Tree + entity heatmap, Cohere wide-K, Claude Sonnet, concise (reproduced)	0.7578	F1 0.4605 > mem0 F1 0.4057; beats mem0 on all three metrics
(rows below run on CMC-pruned experiment corpus — not directly comparable to rows above)
++ tree expansion in retrieval + recency mode, wide-K, Claude Sonnet	~~0.8148~~	(superseded — pending re-measurement)
++ adaptive K + entity hyperedge, wide-K, gpt-4.1-mini	~~0.8052~~	(superseded — pending re-measurement)
++ full feature stack (adaptive K + entity + recency + tree), Claude Sonnet	~~0.8201~~	(superseded — pending re-measurement)

Finding 1: Reranker choice (Cohere vs ms-marco) on flat retrieval: +0.002 — noise. The reranker does not drive performance; the candidate pool does.

Finding 2: Tree + entity heatmap adds +0.12 on top of both rerankers. Structural organization is the primary architectural contribution.

Finding 3: Wide retrieval (K=200 candidates → Cohere selects 50) is a large lever — retrieval coverage, not reranking, was the primary remaining bottleneck, and Cohere with 200 candidates is substantially more effective than Cohere with 50. (The originally reported +0.123 magnitude for this step depended on the superseded wide-K score and is pending re-measurement; the reproduced wide-K + Sonnet configuration scores 0.7253. The direction is not in doubt; the exact magnitude under corrected scoring is being re-measured.)

Finding 4: The reasoning-model upgrade (gpt-4.1-mini → Claude Sonnet) is marginal at equal retrieval width — the LLM contribution is near noise-level. Memory retrieval quality dominates answer quality; practitioners do not need a premium reasoning model to achieve strong results — they need better retrieval coverage. (The specific deltas originally reported, +0.008 at K=50 and +0.023 at wide-K, are not re-verified or depend on a superseded run; the qualitative conclusion stands.)

Finding 5: Multi-hop — flat + Cohere: ~0.29; tree + heatmap + Cohere: ~0.70 (reproduced multi-hop 0.6979, vs mem0's flat retrieval at 0.5417). Intersection-count ordering in the entity heatmap, not the reranker, enables cross-entity multi-hop reasoning.

Finding 6 (superseded — pending re-measurement): This configuration (tree expansion in retrieval + recency-aware scoring) was measured only in the superseded scoring run; its reported overall delta (−0.008) and failure-count shifts are not reproducible from surviving artifacts. The qualitative motivation — these features improve context quality (the right version of a fact reaches the model) rather than closing raw retrieval gaps — is retained as a hypothesis to be re-tested.

Finding 7 (superseded — pending re-measurement): The adaptive-K + entity-hyperedge configuration was measured only in the superseded run, where it appeared to lift open-domain to parity with mem0 (a reported 0.8216 vs 0.8109). That result does not reproduce: in the corrected, reproducible configuration open-domain is 0.7229, still below mem0's 0.8109 (−0.088). Adaptive retrieval width remains a well-motivated direction for broad open-domain aggregation queries, but the claim that it reaches or exceeds mem0 on open-domain is withdrawn pending re-measurement.

Finding 8 (superseded — pending re-measurement): The full feature stack with Claude Sonnet was reported at 0.8201 overall with multi-hop 0.7917. Neither figure reproduces from surviving artifacts and both are withdrawn. The reproducible reference point is the wide-K + Claude Sonnet configuration at 0.7253 overall, with multi-hop 0.6979 (+0.156 over mem0). Whether the additional full-stack features (adaptive K, entity-hyperedge expansion, recency-aware scoring, tree expansion in retrieval) improve on that baseline is an open question for re-measurement.

The ablation isolates four independent variables: memory structure (flat vs tree+heatmap), reranker identity (ms-marco vs Cohere), retrieval width (K=50 vs K=200→50), and reasoning model (gpt-4.1-mini vs Claude Sonnet). It reveals a clear contribution hierarchy: structure first, retrieval width second, reranker identity near-zero, reasoning model marginal. This hierarchy is robust to the scoring correction; only the absolute magnitudes of the wide-retrieval steps are affected.

Note on architectural consistency. The first five rows (through "Tree + entity heatmap, Cohere wide-K, Claude Sonnet") use the /memory/context endpoint's tree expansion. The final three rows additionally enable tree expansion within RetrieveAsync, entity-hyperedge expansion, recency-aware scoring, and adaptive K in the retrieval pipeline. These are additive features, not replacements; the early rows are not retroactively affected. Rows sharing the same retrieval pipeline are directly comparable; the last three rows form a separate configuration family measured against the wide-K baselines.

Wide retrieval is the largest single step in the table and reflects the core coverage problem in dense memory retrieval: at K=50 with \~1,300 memories per agent, the initial cosine pass covers only 3–4% of the corpus. Expanding to K=200 before reranking quadruples the search window and allows Cohere to select the best 50 from a far richer candidate set. This is retrieval coverage as architecture, not engineering. (The originally reported +0.123 magnitude for this step is pending re-measurement; see the scoring correction at the top of this section.)

4.3 Comparison vs mem0

Category	mem0	Ours	Delta
single-hop	0.7270	0.7340	+0.007
temporal	0.1371	0.7321	+0.595
multi-hop	0.5417	0.6979	+0.156
open-domain	0.8109	0.7229	−0.088
overall	0.6383	0.7253	+0.087

Both columns are scored by the same LLM judge (GPT-4.1-mini); our system answers with Claude Sonnet under wide retrieval (K=200 candidates → Cohere reranks to 50). We compare against mem0 [1] using an identical evaluation harness: both systems ingest the same 10 LOCOMO conversations and are queried with the same questions. mem0's LLM judge score of 0.6383 is measured on our harness (results/mem0_eval.json), not taken from the published paper, ensuring the comparison is on identical infrastructure. The overall improvement of +0.087 is driven by temporal (+0.595) and multi-hop (+0.156) — exactly the two categories on which a structured memory should win, and where the temporal tree and entity heatmap provide advantages that mem0's flat retrieval cannot replicate.

Open-domain is the one category where mem0 leads (−0.088). Open-domain questions in LOCOMO tend to be broad aggregation queries ("What are \[person]'s hobbies?") that reward high recall over precision, and our precision-oriented retrieval pipeline surfaces fewer of the relevant facts. We previously reported that adaptive retrieval width closes this gap and reaches parity with mem0; that result was measured only in the superseded scoring run and does not reproduce (see §4.2, Finding 7). Closing the open-domain gap remains future work.

Single-hop is effectively at parity (+0.007): on simple direct-recall questions, retrieval structure adds little because the target memory is almost always in the top-K cosine results regardless of expansion strategy. Both systems land near 0.73 on this category.

Metric provenance note. The 0.6383 figure is mem0's LLM judge score on our evaluation harness, not the F1 score reported in the mem0 paper. The two metrics are not interchangeable (see §4.5). All comparisons in this paper use LLM judge scores measured on the same harness.

All-metric comparison (concise format). When our system answers in five words or fewer — matching the extractive format mem0 uses — it achieves LLM 0.7578, F1 0.4605, BLEU 0.3843, compared to mem0's LLM 0.6383, F1 0.4057, BLEU 0.3599. This is the only configuration in which a direct apples-to-apples F1 comparison is valid (see §4.5). Under matched answer format, our system leads mem0 on all three metrics.

4.4 Failure Analysis

We run the failure classifier on all incorrect answers. (The specific counts below were produced on the superseded full-feature-stack run and are pending re-measurement on the corrected, reproducible configuration. We retain them as the qualitative failure profile — which we expect to hold — but the exact counts and percentages should not be cited.)

Failure mode	Count	Percentage	Status
Retrieval failure	174	~63%	(superseded run)
Reasoning failure	99	~36%	(superseded run)
Extraction failure	3	~1%	(superseded run)

The qualitative picture is what matters and is consistent across configurations: retrieval failure is the dominant mode (the correct memory is not among the top-K candidates), while extraction failure is negligible (~1%). As context quality improves, retrieval becomes the cleaner remaining bottleneck — the remaining errors are cases where the correct memory is not among the top candidates, and closing the gap requires a larger candidate pool or a better initial embedding, not better context handling. Re-measuring the exact distribution on the corrected configuration is pending.

4.5 Metric Justification

We report LLM judge score as the primary metric rather than BLEU or token F1. BLEU is a reference-overlap metric originally developed for machine translation [11], while LLM-as-a-judge methods are increasingly used for open-ended generation evaluation, albeit with known bias and reliability concerns [12,13]. This choice requires justification because F1 is the standard in QA benchmarks and the mem0 paper reports F1 as its primary metric.

Token F1 measures lexical overlap between a system answer and a gold reference answer. For systems that return short, extractive answers ("tech startup in Austin"), F1 is a reasonable proxy for correctness. For conversational memory systems like ours, where the agent generates a full natural-language answer ("Caroline works at a tech startup she joined in the fall of that year"), F1 will penalize the system even when the answer is semantically correct. This is a measurement artifact, not a capability difference.

We demonstrate this concretely: when we force our system to answer in five words or fewer (matching the format of extractive systems), token F1 rises from 0.1356 to 0.4605 — an improvement of +0.325 — while the LLM judge score rises marginally from 0.7357 to 0.7578. The format change makes our system look substantially better on F1 without hurting answer quality, confirming that the original F1 gap was entirely a formatting artifact: the answers were semantically correct but verbose, and token overlap penalized verbosity rather than incorrectness. Under matched concise format, our system leads mem0 on all three metrics: LLM 0.7578 vs 0.6383 (+0.119), F1 0.4605 vs 0.4057 (+0.055), BLEU 0.3843 vs 0.3599 (+0.024).

The LLM judge, which evaluates whether the answer is factually correct and responsive to the question regardless of phrasing, is the appropriate primary metric for conversational memory systems. We report BLEU and F1 for completeness and to allow comparison with papers that use those metrics.

5. Discussion

Why the tree helps temporal reasoning

Temporal questions in LOCOMO follow a consistent pattern: they ask about a state that was true before or after some reference event ("What was her job before she moved?", "Did they stay in touch after that?"). Answering these questions correctly requires knowing not just the current state of a fact but its history — the sequence of updates that led to the current state.

In a flat memory system, each version of a fact is a separate record with no structural relationship to its predecessors. Retrieving the current state is easy; retrieving the prior state requires inferring from timestamps alone, which is brittle when memories are densely sampled or when the timeline is ambiguous.

In our temporal tree, the ancestor chain of any memory is exactly the history of that fact. A current memory ("Caroline works at Meridian") has a parent ("Caroline joined Meridian after leaving her previous role") which has its own parent ("Caroline was considering a career change"). Walking up the ancestor chain for any retrieved memory produces the temporal context for that fact automatically, without additional queries. The +0.595 temporal gain over mem0 (0.7321 vs 0.1371) reflects this: the tree topology encodes time as structure rather than as a metadata field to be reasoned about. Temporal is the single largest category win in the paper.

Why the entity heatmap helps multi-hop reasoning

Multi-hop questions require connecting facts about two entities: "Where was \[person A] living when \[person B] got married?" The answer requires retrieving a fact about person A (their location) and a fact about person B (their marriage date) and reasoning across them. The difficulty is that a query about person A may not cosine-rank memories about person B highly, so flat retrieval tends to over-represent one entity and miss the other.

The entity heatmap addresses this with intersection-count ordering: memories tagged with both person A and person B rank above memories tagged with only one. These intersection memories are precisely the cross-entity facts that multi-hop questions require — they represent moments when both entities are mentioned together in the source conversation, which is almost always what the question is probing. The +0.156 multi-hop gain over mem0 (0.6979 vs 0.5417) and the collapse to ~0.29 when tree expansion is removed (flat retrieval, same Cohere reranker) both confirm that this structural signal, not reranking quality, is what enables multi-hop reasoning.

The entity-hyperedge expansion introduced in §3 extends this further: after the initial vector search, the system explicitly walks the entity co-reference graph to surface memories sharing 2+ named entity references with the top-K results. This adds intersection-cluster memories that the cosine pass missed but that are structurally connected through shared entities. (An earlier version reported that this expansion plus Claude Sonnet lifted multi-hop to 0.7917; that figure came from the superseded run and does not reproduce — see §4.2, Finding 8. The reproducible multi-hop figure is 0.6979.) The threshold of 2+ shared entities is load-bearing: single-entity neighbors add noise that regresses multi-hop; intersection-only neighbors add signal.

The wide-K finding and its implications

Expanding the candidate pool from 50 to 200 before reranking is the largest single-step improvement in our ablation, on par with or larger than the tree structure itself. (The originally reported +0.123 magnitude for this step is superseded and pending re-measurement; see §4.2. The direction is robust.) This follows directly from the coverage arithmetic: at K=50 with \~1,300 memories per agent, the initial cosine pass samples 3.8% of the corpus. At K=200, coverage rises to 15.4%. For agents with longer memory histories, this gap will only grow.

The practical implication is that retrieval width is a first-class architectural concern, not a hyperparameter to tune late. Systems that design around K=50 for cost reasons are implicitly accepting a hard ceiling on recall. Cohere reranking is only as good as the candidates it receives; a cross-encoder cannot recover a fact that was never retrieved.

The corollary — that the reasoning LLM is a marginal contributor at equal retrieval width — challenges the common assumption that upgrading to a more capable model is the primary lever for improving memory-augmented agent performance. (The specific per-step model deltas originally reported, +0.008 at K=50 and +0.023 at wide-K, are not re-verified or depend on a superseded run; the qualitative conclusion holds.) Our results suggest the opposite of the common assumption: for the majority of answerable questions, the bottleneck is whether the right memory is in the context window, not whether the model can reason over it once it arrives.

Open-domain: the remaining gap

Our system underperforms mem0 on open-domain questions (0.7229 vs 0.8109, −0.088) — the one category where the architecture trails. Open-domain questions in LOCOMO are broad aggregation queries ("What are \[person]'s interests?", "How would you describe \[person]'s personality?") that benefit from high recall across all memory types rather than high precision on a specific fact.

We diagnose this as a query-shape mismatch rather than an architectural failure. Our retrieval pipeline is optimized for precision: K=50 candidates, diversity filter, entity heatmap with fixed caps. This setup is effective for temporal and multi-hop questions, where the target memory is specific and the risk of irrelevant candidates is high. For open-domain aggregation queries, the same setup starves the model of breadth.

A natural fix follows from this diagnosis: adaptive K, a query-type classifier that scales retrieval width by query breadth, routing open-domain questions to a wider (K×6) search path before reranking. We previously reported that this lifts open-domain past mem0 (a reported 0.8216 vs 0.8109); that result was measured only in the superseded scoring run and does not reproduce. In the corrected configuration open-domain stands at 0.7229, below mem0. Whether adaptive K actually recovers the gap is an open empirical question pending re-measurement; the diagnosis (a precision-tuned pipeline starves breadth-seeking queries) remains our working hypothesis.

We read the open-domain gap as a query-shape mismatch, not as evidence that the temporal tree is fundamentally ill-suited to broad retrieval: a precision-tuned pipeline can in principle be widened for aggregation queries without abandoning the structure that wins on temporal and multi-hop. Demonstrating that widening closes the gap under corrected scoring is future work.

6. Future Work

Four directions follow naturally from the findings and limitations identified in this paper.

Branch summary quality depends on tree quality. Fork-to-fork narrative summaries are implemented and live as a retrieval navigation aid, but the current experiment tree was built with a loose parent assignment threshold (0.60 cosine) that produced hub-heavy topology. Summaries generated on a cleaner tree (0.85 threshold) are expected to be more coherent and provide stronger branch navigation signal. Re-ingesting with the corrected threshold and re-running the summary generation job is the immediate next experiment.

Persistent typed memory edges. The current entity-hyperedge expansion derives connections on the fly via entity co-reference in memory_entities. A persistent edge table with LLM-written edge descriptions — "these two memories contradict each other about Caroline's timeline" — would enable richer traversal, pruning, and explanation. Typed edges (contradiction, refinement, temporal succession, causal) between fork nodes and summary nodes would form an explicit hyperedge layer above the temporal tree. Edge descriptions are themselves first-class memories in the model — facts about the relationship between two other facts — enabling a self-describing memory graph where the structure and its meaning are stored in the same representation.

Embedding fine-tuning. The query-statement semantic gap — questions are phrased differently from the facts that answer them — is currently bridged by wide retrieval (K=200) and cross-encoder reranking. A bi-encoder fine-tuned on (question, memory) pairs from LOCOMO or similar corpora would close this gap at the retrieval stage itself, reducing dependence on Cohere and enabling competitive performance at smaller K. The wide-K finding (+0.123 from K=50 to K=200) quantifies exactly how much room a better bi-encoder has to recover.

LLM-driven graph rewriting. The current deterministic maintenance pass (§3.3) handles structural hygiene without LLM calls. A second pass could apply LLM reasoning to consolidate a memory cluster into a single canonical sentence and generate bridging memories that connect nearby tree branches sharing implicit context. This is the long-horizon complement to the write-time tree structure: where the tree encodes relationships as they arrive, a rewriting pass repairs and enriches the structure as the corpus matures.

ONNX self-hosted reranking. Cohere's rerank-v3.5 is the current cross-encoder backend. Deploying a quantized cross-encoder (e.g., ms-marco-MiniLM-L6) locally via ONNX would remove the external API dependency for production deployments where latency and data residency matter. Our ablation shows the reranker identity contributes only +0.002 in isolation — the cost of switching backends is low.

7. Conclusion

We presented Temporal Tree Memory, a memory architecture for AI agents in which every stored fact is assigned a position in a tree at write time. The tree encodes time as topology (root to leaf), fact refinement as parent-child, and contradiction as forking branches. Retrieval becomes branch traversal: a semantic entry point, followed by structured expansion through tree descendants, entity co-reference neighbors, and fork-to-fork narrative summaries. A complementary entity heatmap ensures that memories connecting multiple named entities — the load-bearing signal for multi-hop reasoning — are always represented in context regardless of their cosine rank.

On LOCOMO, our system achieves an LLM judge score of 0.7253 (Claude Sonnet, wide retrieval), compared to 0.6383 for mem0 on the same harness and the same judge (GPT-4.1-mini) — an improvement of +0.087 overall. The gains are concentrated exactly where the architecture is designed to help: temporal reasoning (+0.595 over mem0) and multi-hop reasoning (+0.156). Single-hop is at parity (+0.007), and open-domain remains the one category where mem0 leads (−0.088). A systematic ablation identifies a clear contribution hierarchy: memory structure first, retrieval coverage second, reranker identity near-zero, reasoning model marginal.

Three findings from the ablation are practically important beyond this paper. First, the choice of cross-encoder reranker contributes ~+0.002 in isolation — it does not drive performance. The dominant variable is whether the right memory is in the candidate pool, not how candidates are reranked once retrieved. Second, retrieval width is a first-class architectural concern: widening the candidate pool before reranking is one of the largest levers in the ablation (the originally reported +0.123 magnitude is pending re-measurement; see §4.2). Systems that fix K for cost reasons accept a hard recall ceiling. Third, the reasoning-model upgrade from gpt-4.1-mini to Claude Sonnet is marginal at equivalent retrieval width. For most memory-augmented agent applications, the bottleneck is retrieval coverage, not reasoning capability. Investing in memory structure and retrieval width outperforms investing in a more capable model.

Open-domain remains the one category where mem0 leads (−0.088). We hypothesize that matching retrieval width to query breadth (adaptive K) recovers this gap — an earlier run suggested as much, but that result does not reproduce, so the question is open. Precision-optimized retrieval is not inherently worse at broad queries; demonstrating that a query-type classifier closes the open-domain gap under corrected scoring is future work.

Temporal Tree Memory is available as an open system with a documented protocol surface, a Python SDK (pip install aivery, https://pypi.org/project/aivery/, source at https://github.com/aivery-systems/aivery-sdk), and a benchmark harness that enables reproducible comparison on LOCOMO and BEAM. We believe the core finding — that structure at write time, not post-hoc clustering, is what enables temporal and multi-hop reasoning — will generalize beyond this benchmark to any memory-augmented agent operating over long conversation histories.

References

[1] Chhikara, P., et al. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. https://arxiv.org/abs/2504.19413

[2] Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. https://arxiv.org/abs/2310.08560

[3] Rasmussen, P., et al. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. https://arxiv.org/abs/2501.13956

[4] LangChain. (2025). LangMem: Long-Term Memory for AI Agents. Documentation and SDK. https://langchain-ai.github.io/langmem/

[5] Maharana, A., Lee, D.-H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating Very Long-Term Conversational Memory of LLM Agents. arXiv:2402.17753. https://arxiv.org/abs/2402.17753

[6] Tavakoli, M., et al. (2025). Benchmarking and Enhancing Long-Term Memory in LLMs. arXiv:2510.27246. https://arxiv.org/abs/2510.27246

[7] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401

[8] Khattab, O., & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR 2020. https://arxiv.org/abs/2004.12832

[9] Sentence-Transformers. (n.d.). MS MARCO Cross-Encoders. https://www.sbert.net/docs/pretrained-models/ce-msmarco.html

[10] Cohere. (2024). Rerank v3.5 Documentation. https://docs.cohere.com/docs/rerank

[11] Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002. https://aclanthology.org/P02-1040/

[12] Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. EMNLP 2023. https://arxiv.org/abs/2303.16634

[13] Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. https://arxiv.org/abs/2306.05685