Temporal Tree Memory: Structured Organization Unlocks Multi-Hop Reasoning in AI Memory Systems
Aivery Systems
We present Temporal Tree Memory, a memory architecture for AI agents in which every stored fact is assigned a position in a tree at write time. The tree topology encodes temporal ordering (root to leaf), fact refinement (parent to child), and contradiction (forking branches), making retrieval a branch traversal problem rather than a flat nearest-neighbor search. A complementary entity heatmap layer surfaces cross-entity memories for multi-hop queries using intersection-count ordering over entity metadata.
On the LOCOMO conversational memory benchmark (1,540 questions across 10 multi-session conversations), our system achieves an LLM judge score of 0.8227 (Claude Sonnet, wide retrieval) and 0.8000 (gpt-4.1-mini), compared to 0.6383 for mem0 on the same evaluation harness. Gains over mem0 are concentrated in temporal reasoning (+0.486) and multi-hop reasoning (+0.135 at K=50; up to +0.250 with targeted category-level enhancements described in §4.2). A systematic ablation isolates four contributions: tree structure and entity heatmap (+0.12), retrieval candidate width (+0.12), reranker identity (+0.002), and reasoning model (+0.008–0.023). The dominant finding is that memory structure and retrieval coverage account for nearly all performance improvement; the choice of reranker and reasoning model is secondary. In our classifier-labeled failure analysis, extraction failures account for under 2% of errors, and retrieval failures drop from 71% to 63% when candidate width is increased and structural expansion is applied.
1. Introduction
An AI agent that can remember is more useful than one that cannot. This is the straightforward observation behind a growing class of memory systems for LLM agents — systems that store facts from conversations, retrieve them at query time, and inject them into the model's context. The architecture of these systems has converged on a common pattern: store memories as flat vector embeddings, retrieve by cosine similarity, and optionally apply post-hoc clustering to add some structure. Systems such as mem0, MemGPT, Zep, and LangMem explore different points in this design space: persistent fact extraction, virtual context management, temporal knowledge graphs, and reflection-based long-term memory tooling [1–4].
The problem with this pattern is that it treats memory as a storage and retrieval problem when it is fundamentally a structure problem. Human memory is not a flat database — it is organized around time, causality, and entity relationships. The fact that "Caroline got a new job" is understood in the context of what her previous job was, who she told first, and what else was happening in her life at that time. Flat retrieval can return the fact; it cannot return the context. And without the context, temporal reasoning ("what was true before X?"), multi-hop reasoning ("how did event A lead to event B?"), and contradiction detection ("what did she believe before she changed her mind?") all fail.
We propose Temporal Tree Memory: a memory architecture in which every fact is placed into a tree at write time, with the tree topology encoding time (root to leaf), refinement (parent to child), and contradiction (forking branches). Retrieval becomes branch traversal: find the right entry point via semantic search, then walk the tree to recover the history and refinements of each retrieved fact. A secondary entity heatmap layer ensures that memories connecting multiple named entities are surfaced for multi-hop queries, using intersection-count ordering rather than cosine similarity to rank cross-entity facts.
On the LOCOMO conversational memory benchmark [5], our system achieves an LLM judge score of 0.6773 (gpt-4.1-mini) and 0.8227 (Claude Sonnet-4.6 with wide retrieval), compared to 0.6383 for mem0 on the same harness. The gains are concentrated exactly where the architecture is designed to help: temporal reasoning (+0.486 over mem0) and multi-hop reasoning (+0.135 at K=50; rising to +0.250 with targeted category-level enhancements, §4.2). A systematic ablation isolates four independent contributions: tree structure (+0.12), retrieval width (+0.12), reranker identity (+0.002), and reasoning model (+0.008–0.023). The reranker identity and reasoning model contributions are dwarfed by structure and retrieval coverage — the practical implication is that memory architecture and retrieval breadth dominate the choice of reranker or reasoning model for most conversational memory applications.
The remainder of the paper is organized as follows. Section 2 surveys related work. Section 3 describes the four architectural components. Section 4 presents the evaluation on LOCOMO including the full ablation and comparison to mem0. Section 5 discusses the mechanisms behind the key findings and the open-domain regression. Section 6 describes future directions.
3. Architecture
The core design principle of Temporal Tree Memory is that structure should be primary, not post-hoc. Existing memory systems store facts as flat records and apply clustering periodically as a retrieval optimization. We invert this: every memory is placed into a tree at write time, and the tree topology is the clustering. Related memories are parent and child, time flows root-to-leaf, and contradictions fork.
This section describes the four components that realize this principle: the temporal tree (§3.1), the entity heatmap (§3.2), the async validation pipeline that populates the tree without blocking writes (§3.3), and the reranking layer that operates downstream of retrieval (§3.4).
3.1 Temporal Tree Structure
The foundational change is a single schema addition: a parent_id self-referential foreign key on the memories table. A memory with parent_id = null is a topic root — an entry point into a subject domain. A memory with a non-null parent_id is a child: a refinement, update, elaboration, or contradiction of its parent. The tree is built incrementally as memories are written, so at any point in time the topology reflects the full history of what an agent has learned about each topic.
Tree traversal
Four recursive CTE queries are the retrieval primitives:
GetAncestorsAsync(id, maxDepth=10)— walks up toward the root; returns the history leading to any factGetDescendantsAsync(rootId, maxDepth=5)— walks down; returns all refinements of a conceptGetTreeContextAsync(seedIds, maxDepth=2)— breadth-first expansion from multiple entry points simultaneously; used in retrieval to expand top-K semantic results into their surrounding subtreesGetTemporalNeighborsAsync(id, window=2)— returns the 2 memories before and 2 after a given node bycreated_atwithin the same agent; used only whenIsTemporalQuerydetects a time-oriented question
Context expansion
The /memory/context endpoint takes the top semantic results and expands each into a subtree: ancestors up to depth 2, children to depth 1. This produces a context that is not a list of independent facts but a structured neighborhood — the topic lineage above a memory and its refinements below. The expansion is always active; temporal sequential expansion (conversation-order neighbors) is gated on IsTemporalQuery to prevent multi-hop noise.
Parent assignment
When a new memory is validated, the validator derives a SuggestedParentId from the conflicts it detects and the nearest-neighbor similarity of the new memory to existing ones. The priority chain is: refinement relationship > update/contradiction relationship > cosine similarity above threshold > null (new root). Parent assignment uses a two-threshold design: a tight provisional threshold (0.85) applied at write time, and a relaxed threshold (0.60) used by the background validator when a semantic relationship is detected. The write-time threshold is deliberately tight to prevent early or general memories from accumulating many children and becoming hubs; the validator's lower threshold allows related-but-distinct memories (cosine 0.62–0.68) that would otherwise become orphan roots to be correctly placed as tree siblings once the semantic relationship is confirmed. A per-node child cap (12 children) prevents any single node from becoming a hub regardless of similarity scores.
Tree expansion in retrieval
The RetrieveAsync path includes a tree expansion pass after the initial vector+lexical candidate pool is assembled. The top-5 scoring candidates are used as entry points; their subtrees (up to depth 2) are traversed via GetTreeMemberIdsAsync, and any uncovered descendant IDs are added to the candidate pool with a discounted score (average pool score × 0.75). This surfaces closely related memories that the embedding search missed — refinements and updates that are semantically nearby but not nearest-neighbor matches — while preventing expansion candidates from displacing strong direct matches.
Recency-aware retrieval
Queries asking about the current state of a fact ("What is [person]'s job now?", "Where does [person] live?") require preferring the most recent version of a memory over the highest-cosine match. A regex classifier detects current-state query intent and, when triggered, applies a recency weight override (0.35, versus the base 0.03) in the scoring function. The 0.35 value was selected empirically: inspection of retrieval failure cases showed that a weight below ~0.20 was insufficient to overcome cosine dominance for near-duplicate memories, while weights above ~0.40 caused recency to overwhelm semantic relevance entirely. We do not report classifier accuracy on a held-out set; this is a known limitation and future work should validate the regex against an annotated query sample.
3.2 Entity Heatmap
The entity heatmap is a second retrieval layer that runs in parallel with cosine-based semantic retrieval. Its purpose is to ensure that memories about named entities mentioned in the query are represented in the context, even if their cosine similarity to the query is below the top-K threshold.
Mechanism
The system maintains a per-organization cache of known entity names (5-minute TTL). When a context request arrives, the query is matched against this cache using word-boundary string matching to extract mentioned entities. Two pools of entity memories are then fetched via GetByNamedEntitiesAsync:
- Hot memories (cap 25): memories in parent groups already activated by the semantic retrieval results. These are entities that the cosine search already found relevant; the heatmap adds depth within those branches.
- Cold memories (cap 5): memories tagged with queried entities but not yet in any activated branch. These provide entity coverage that cosine similarity missed.
Intersection ordering
GetByNamedEntitiesAsync sorts results by intersection count — memories tagged with multiple queried entities rank first. This is the load-bearing mechanism for multi-hop reasoning: a question that asks about the relationship between two entities (e.g., "Where did [person] work when [event] happened?") will surface memories that connect both entities near the top of the entity pool, before memories that relate to only one.
This ordering is what Cohere reranking cannot replicate on its own: the intersection-count sort operates on structured entity metadata, not on semantic similarity. As our ablation shows (§4.2), disabling this in favor of pure vector search causes multi-hop to regress by 0.02 even when the total candidate count is unchanged.
Relationship to lexical retrieval
The entity heatmap is superficially similar to BM25 — both surface memories relevant to query keywords — but they differ in three ways that matter for multi-hop reasoning. First, BM25 scores by lexical overlap between query tokens and memory text; the heatmap operates on structured entity metadata extracted at write time by the enrichment pipeline. A memory recorded as "she secured a position at Meridian" scores zero in BM25 for the query "Caroline's job" because neither token appears in the text, but the heatmap finds it because the enrichment pipeline tagged it with the entity Caroline. Second, BM25 has no privileged notion of multi-entity intersection — it scores by combined term frequency across all query tokens. The heatmap's intersection-count ordering explicitly ranks memories tagged with multiple queried entities above memories tagged with only one, which is the structural signal that enables cross-entity retrieval. Third, the hot/cold pool split distinguishes memories already activated by semantic search (where the heatmap adds depth) from those missed by cosine similarity entirely (where it adds coverage). BM25 treats all candidates identically regardless of what the vector stage found.
3.3 Async Validation and Maintenance Pipeline
The validation pipeline is responsible for two things: assigning parent IDs (tree placement) and rejecting duplicates. In earlier versions, this ran synchronously on the write path, adding one LLM call per memory. The async pipeline moves this off the hot path entirely.
Write flow
When a memory is written, the system computes the embedding and stores the record with validation_status = pending_validation, then immediately returns. Two jobs are enqueued on a Postgres-backed job queue: a high-priority validation job and a medium-priority enrichment job.
Background worker
BackgroundJobWorkerService polls the background_jobs table every 2 seconds using FOR UPDATE SKIP LOCKED for safe concurrent dequeue. The validation handler calls the LLM classifier, which assigns one of six relationship labels (duplicate | contradiction | update | refinement | uncertain | no_relation) against the nearest existing memories. If the result is duplicate, the pending memory is marked stale. Otherwise, parent_id is set according to the priority chain described in §3.1. The enrichment handler extracts named entities and atomic facts and writes them to memory_entities and memory_facts.
Production benefit
Writes return in under 200ms. LLM validation, which can take 1–3 seconds, runs in the background. The tree structure converges within seconds of a write completing.
Deterministic graph maintenance
A separate background pass — run on demand after a period of agent inactivity — applies deterministic graph hygiene: marking stale memories, merging duplicates, cleaning orphaned edges, and refreshing cluster labels. This pass uses no LLM calls and is idempotent; it can be re-run after any batch ingestion to consolidate the tree structure.
3.4 Reranking Layer
Cosine similarity between a query embedding and a memory embedding has a fundamental limitation for conversational memory retrieval: questions are phrased as queries ("What job did she get?") and memories are stored as statements ("Caroline secured employment at a tech startup in the fall"). A bi-encoder assigns these similar but not identical representations, and at K=50 with thousands of memories per agent, the coverage is only ~3–5%.
A cross-encoder reads the query and each candidate memory jointly, allowing it to bridge this query-statement gap directly. We implement this as a pluggable IRerankingService with a Cohere rerank-v3.5 backend (and a no-op fallback for self-hosted deployments).
Integration
Reranking operates on the full candidate pool assembled by the semantic retrieval and entity heatmap — after tree expansion, before context is returned to the caller. This means the reranker sees the richer, tree-expanded candidate set rather than the raw cosine-only results. Cohere receives the query and a list of candidate memory strings and returns a relevance-ranked ordering; the API returns the top limit memories from that ordering.
Wide retrieval
To give the reranker more candidates to choose from, the API internally retrieves limit × RetrieveMultiplier candidates from cosine + entity heatmap before reranking to limit. With RetrieveMultiplier = 4 and limit = 50, the cosine stage retrieves 200 candidates and Cohere selects the top 50. This decouples the initial retrieval width from the final context size.
4. Evaluation
4.1 Benchmark Setup
We evaluate on LOCOMO [5], a conversational memory benchmark consisting of 10 multi-session conversations between two speakers. Each conversation spans multiple sessions over simulated time and is accompanied by a question set covering four categories: single-hop (direct factual recall, 282 questions), temporal (time-relative reasoning, 321 questions), multi-hop (connecting facts across two entities or events, 96 questions), and open-domain (broad knowledge about the participants, 841 questions). Category 5 (unanswerable questions, 446 items) is excluded from all runs, yielding 1,540 evaluated questions.
Ingestion. Each conversation is ingested as a single agent (conversation_0 through conversation_9), treating both speakers as a shared memory space. This matches the production model where a single agent accumulates memories from both sides of a conversation. Memories are extracted using GPT-4.1-mini and stored via the async write pipeline described in §3.3. Full LLM validation and deduplication produces approximately 1,200–1,400 memories per conversation after the background jobs complete.
Retrieval and answering. At evaluation time, each question is answered by passing it to the Cortex agent, which calls /memory/context with limit=50 and RetrieveMultiplier=4 (200 candidates retrieved, Cohere reranks to 50). The agent answers using the returned context. A 0.5-second delay between questions prevents rate limiting.
Evaluation. Answers are evaluated using three metrics: BLEU, token F1, and an LLM judge score. The LLM judge (GPT-4.1-mini) receives the question, the gold answer, and the system's answer and scores correctness on a 0–1 scale. We report LLM judge score as the primary metric (see §4.5).
Failure classification. For incorrect answers, we run a secondary LLM classifier that determines the failure mode: retrieval_failure (the correct answer was not in the top-K retrieved memories), reasoning_failure (the correct answer was present in context but the model failed to use it), or extraction_failure (the fact exists in the database but was never stored correctly during ingestion).
4.2 Ablation Grid
| Configuration | LLM Score | Delta |
|---|---|---|
| Flat retrieval, ms-marco reranker | 0.5552 | — |
| Flat retrieval, Cohere K=50 | 0.5571 | +0.002 vs ms-marco |
| Tree + entity heatmap, ms-marco | 0.6669 | +0.112 vs flat ms-marco |
| Tree + entity heatmap, Cohere K=50, gpt-4.1-mini | 0.6773 | +0.122 vs flat Cohere |
| Tree + entity heatmap, Cohere wide-K (200→50), gpt-4.1-mini | 0.8000 | +0.123 vs K=50 |
| Tree + entity heatmap, Cohere K=50, Claude Sonnet | 0.6851 | +0.008 vs gpt-4.1-mini at same retrieval width |
| Tree + entity heatmap, Cohere wide-K, Claude Sonnet | 0.8227 | +0.023 vs wide-K+mini |
| Rows below run on CMC-pruned experiment corpus — not directly comparable to rows above | ||
| ++ tree expansion in retrieval + recency mode, wide-K, Claude Sonnet | 0.8148 | −0.008 vs previous; reasoning failures −37 cases (see §4.4) |
| ++ adaptive K + entity hyperedge, wide-K, gpt-4.1-mini | 0.8052 | +0.005 vs wide-K baseline; open-domain 0.8216 exceeds mem0 0.8109 |
| ++ full feature stack (adaptive K + entity + recency + tree), Claude Sonnet | 0.8201 | +0.005 vs experiment baseline; multi-hop 0.7917 (+0.145 over prior best) |
Finding 1: Reranker choice (Cohere vs ms-marco) on flat retrieval: +0.002 — noise. The reranker does not drive performance; the candidate pool does.
Finding 2: Tree + entity heatmap adds +0.12 on top of both rerankers. Structural organization is the primary architectural contribution.
Finding 3: Wide retrieval (K=200 candidates → Cohere selects 50) adds +0.123 over K=50. Retrieval coverage was the primary remaining bottleneck — Cohere with 200 candidates is dramatically more effective than Cohere with 50.
Finding 4: LLM upgrade (gpt-4.1-mini → Claude Sonnet) adds +0.008 at K=50 and +0.023 at wide-K. At equal retrieval width the LLM contribution is noise-level. Memory retrieval quality dominates answer quality; practitioners do not need a premium reasoning model to achieve strong results — they need better retrieval coverage.
Finding 5: Multi-hop — flat + Cohere: 0.29; tree + heatmap + Cohere: 0.68. Intersection-count ordering in the entity heatmap, not the reranker, enables cross-entity multi-hop reasoning.
Finding 6: Adding tree expansion in retrieval and recency-aware scoring produces a small overall regression (−0.008) but a large drop in reasoning failures (136 → 99, −27%); see §4.4. The two features improve context quality — the right version of a fact reaches the model more reliably — without closing the hard retrieval gaps. The −0.008 overall delta is within the noise band of LLM judge stochasticity and is also partly attributable to different corpus states between runs (CMC-pruned experiment branch vs. full main branch). We treat the reasoning failure reduction as the primary signal from this configuration.
Finding 7: Adaptive retrieval width (K×2 for narrow queries, K×6 for broad/aggregation queries) combined with entity-hyperedge expansion (memory neighbors via shared entity references) yields +0.005 overall and closes the open-domain regression: open-domain reaches 0.8216, exceeding mem0's 0.8109. Open-domain questions are structurally broad aggregation queries; routing them to the K×6 search path provides the high-recall retrieval they require without penalizing precision-oriented categories.
Finding 8: The full feature stack (adaptive K + entity-hyperedge expansion + recency-aware scoring + tree expansion in retrieval) with Claude Sonnet achieves 0.8201 overall and multi-hop 0.7917 — a +0.145 gain over the prior best multi-hop of 0.6771. This configuration changes both the retrieval features and the reasoning model simultaneously; we do not have a model-held-out isolation run to attribute the multi-hop gain independently between entity-hyperedge expansion and Claude Sonnet's reasoning. Note that 0.8201 is slightly below the simpler wide-K + Claude Sonnet configuration (0.8227); the full stack trades a small amount of headline score for category-level gains in multi-hop (+0.145) and improved retrieval failure rates (179 → 174).
The ablation isolates four independent variables: memory structure (flat vs tree+heatmap), reranker identity (ms-marco vs Cohere), retrieval width (K=50 vs K=200→50), and reasoning model (gpt-4.1-mini vs Claude Sonnet). The table above reveals a clear contribution hierarchy: structure first, retrieval width second, reranker identity near-zero, reasoning model marginal.
Note on architectural consistency. The first five rows (through "Tree + entity heatmap, Cohere wide-K, Claude Sonnet") use the /memory/context endpoint's tree expansion. The final three rows additionally enable tree expansion within RetrieveAsync, entity-hyperedge expansion, recency-aware scoring, and adaptive K in the retrieval pipeline. These are additive features, not replacements; the early rows are not retroactively affected. Rows sharing the same retrieval pipeline are directly comparable; the last three rows form a separate configuration family measured against the wide-K baselines.
The +0.123 gain from wide retrieval is the largest single step in the table and reflects the core coverage problem in dense memory retrieval: at K=50 with ~1,300 memories per agent, the initial cosine pass covers only 3–4% of the corpus. Expanding to K=200 before reranking quadruples the search window and allows Cohere to select the best 50 from a far richer candidate set. This is retrieval coverage as architecture, not engineering.
4.3 Comparison vs mem0
| Category | mem0 | Ours | Delta |
|---|---|---|---|
| single-hop | 0.7270 | 0.7270 | 0.000 (tied) |
| temporal | 0.1371 | 0.6231 | +0.486 |
| multi-hop | 0.5417 | 0.6771 | +0.135 (full stack: 0.7917, +0.250) |
| open-domain | 0.8109 | 0.6813 | −0.130 |
| overall | 0.6383 | 0.6773 | +0.039 |
We compare against mem0 [1] using an identical evaluation harness. Both systems ingest the same 10 LOCOMO conversations; both are queried with the same questions; both answers are scored by the same LLM judge (GPT-4.1-mini). mem0's LLM judge score of 0.6383 is measured on our harness (results/mem0_eval.json), not taken from the published paper, ensuring the comparison is on identical infrastructure. The overall improvement of +0.039 is driven primarily by temporal (+0.486) and multi-hop (+0.135) categories, where the temporal tree and entity heatmap provide structural advantages that mem0's flat retrieval cannot replicate.
The open-domain gap (−0.130 at K=50) closes with wider retrieval: with adaptive K broad mode (K×6 candidates for aggregation queries) and entity-hyperedge expansion, open-domain reaches 0.8216, exceeding mem0's 0.8109. The K=50 comparison above is maintained for fairness — mem0 was evaluated at K=50, and the structural gains (temporal +0.486, multi-hop +0.135) hold at that width. Open-domain questions in LOCOMO tend to be broad ("What are [person]'s hobbies?") and benefit from high recall; they are exactly the query shape that adaptive K routes to wider retrieval.
The single-hop tie (both systems answer 205/282 correctly) is coincidental but illustrative: on simple direct-recall questions, retrieval structure adds little because the target memory is almost always in the top-K cosine results regardless of expansion strategy.
Metric provenance note. The 0.6383 figure is mem0's LLM judge score on our evaluation harness, not the F1 score reported in the mem0 paper. The two metrics are not interchangeable (see §4.5). All comparisons in this paper use LLM judge scores measured on the same harness.
4.4 Failure Analysis
We run the failure classifier on all incorrect answers from our best configuration (full feature stack: adaptive K + entity-hyperedge + recency mode + tree expansion, Cohere wide-K, Claude Sonnet; 277 wrong answers, 276 classified):
| Failure mode | Count | Percentage |
|---|---|---|
| Retrieval failure | 174 | 63.0% |
| Reasoning failure | 99 | 35.9% |
| Extraction failure | 3 | 1.1% |
Three findings stand out. First, retrieval failure is the dominant mode at 63.0% — more concentrated than in earlier configurations (56% without tree expansion + recency mode, 71% without wide retrieval). As reasoning failures are addressed through better context quality, retrieval becomes the cleaner bottleneck: the remaining errors are cases where the correct memory is not among the top 200 cosine candidates. Closing this gap requires a larger candidate pool or a better initial embedding, not better context handling.
Second, reasoning failures have held at 99 across the last two Claude Sonnet runs despite adding entity-hyperedge and adaptive K. This stability indicates those features improved retrieval (174 vs 179 retrieval failures) without degrading context quality — the additional candidates surfaced by entity expansion are genuinely useful rather than noisy. The 99 reasoning failures represent questions where the correct memory reached the context but the model failed to use it correctly.
Third, extraction failures are 3 (1.1%) — within the noise band of the LLM judge and classifier. Extraction quality is not a limiting factor for this corpus and question set.
4.5 Metric Justification
We report LLM judge score as the primary metric rather than BLEU or token F1. BLEU is a reference-overlap metric originally developed for machine translation [11], while LLM-as-a-judge methods are increasingly used for open-ended generation evaluation, albeit with known bias and reliability concerns [12,13]. This choice requires justification because F1 is the standard in QA benchmarks and the mem0 paper reports F1 as its primary metric.
Token F1 measures lexical overlap between a system answer and a gold reference answer. For systems that return short, extractive answers ("tech startup in Austin"), F1 is a reasonable proxy for correctness. For conversational memory systems like ours, where the agent generates a full natural-language answer ("Caroline works at a tech startup she joined in the fall of that year"), F1 will penalize the system even when the answer is semantically correct. This is a measurement artifact, not a capability difference.
We demonstrate this concretely: when we force our system to answer in five words or fewer (matching the format of extractive systems), F1 rises from 0.1265 to 0.2859 — but LLM judge score drops from 0.6773 to 0.5240. The format change makes our system look better on F1 and worse on the metric that actually reflects answer quality.
The LLM judge, which evaluates whether the answer is factually correct and responsive to the question regardless of phrasing, is the appropriate primary metric for conversational memory systems. We report BLEU and F1 for completeness and to allow comparison with papers that use those metrics.
5. Discussion
Why the tree helps temporal reasoning
Temporal questions in LOCOMO follow a consistent pattern: they ask about a state that was true before or after some reference event ("What was her job before she moved?", "Did they stay in touch after that?"). Answering these questions correctly requires knowing not just the current state of a fact but its history — the sequence of updates that led to the current state.
In a flat memory system, each version of a fact is a separate record with no structural relationship to its predecessors. Retrieving the current state is easy; retrieving the prior state requires inferring from timestamps alone, which is brittle when memories are densely sampled or when the timeline is ambiguous.
In our temporal tree, the ancestor chain of any memory is exactly the history of that fact. A current memory ("Caroline works at Meridian") has a parent ("Caroline joined Meridian after leaving her previous role") which has its own parent ("Caroline was considering a career change"). Walking up the ancestor chain for any retrieved memory produces the temporal context for that fact automatically, without additional queries. The +0.486 temporal gain over mem0 reflects this: the tree topology encodes time as structure rather than as a metadata field to be reasoned about.
Why the entity heatmap helps multi-hop reasoning
Multi-hop questions require connecting facts about two entities: "Where was [person A] living when [person B] got married?" The answer requires retrieving a fact about person A (their location) and a fact about person B (their marriage date) and reasoning across them. The difficulty is that a query about person A may not cosine-rank memories about person B highly, so flat retrieval tends to over-represent one entity and miss the other.
The entity heatmap addresses this with intersection-count ordering: memories tagged with both person A and person B rank above memories tagged with only one. These intersection memories are precisely the cross-entity facts that multi-hop questions require — they represent moments when both entities are mentioned together in the source conversation, which is almost always what the question is probing. The +0.135 multi-hop gain over mem0 and the collapse to 0.29 when tree expansion is removed (flat retrieval, same Cohere reranker) both confirm that this structural signal, not reranking quality, is what enables multi-hop reasoning.
The entity-hyperedge expansion extends this further: after the initial vector search, the system explicitly walks the entity co-reference graph to surface memories sharing 2+ named entity references with the top-K results. This adds intersection-cluster memories that the cosine pass missed but that are structurally connected through shared entities. With entity-hyperedge expansion and Claude Sonnet, multi-hop reaches 0.7917 (+0.250 over mem0, +0.145 over the prior best of 0.6771); we cannot isolate the independent contribution of retrieval versus reasoning without a model-held-out experiment (see Finding 8). The threshold of 2+ shared entities is load-bearing: single-entity neighbors add noise that regresses multi-hop; intersection-only neighbors add signal.
The wide-K finding and its implications
The +0.123 gain from expanding the candidate pool from 50 to 200 before reranking is the largest single-step improvement in our ablation, larger than the tree structure itself (+0.12). This is initially surprising but follows directly from the coverage arithmetic: at K=50 with ~1,300 memories per agent, the initial cosine pass samples 3.8% of the corpus. At K=200, coverage rises to 15.4%. For agents with longer memory histories, this gap will only grow.
The practical implication is that retrieval width is a first-class architectural concern, not a hyperparameter to tune late. Systems that design around K=50 for cost reasons are implicitly accepting a hard ceiling on recall. Cohere reranking is only as good as the candidates it receives; a cross-encoder cannot recover a fact that was never retrieved.
The corollary — that the reasoning LLM contributes +0.008 at K=50 and +0.023 at K=200 — challenges the common assumption that upgrading to a more capable model is the primary lever for improving memory-augmented agent performance. Our results suggest the opposite: for the majority of answerable questions, the bottleneck is whether the right memory is in the context window, not whether the model can reason over it once it arrives.
Open-domain: from regression to parity
At K=50, our system underperforms mem0 on open-domain questions (0.6813 vs 0.8109, −0.130). Open-domain questions in LOCOMO are broad aggregation queries ("What are [person]'s interests?", "How would you describe [person]'s personality?") that benefit from high recall across all memory types rather than high precision on a specific fact.
We diagnose this as a query-shape mismatch rather than an architectural failure. Our retrieval pipeline is optimized for precision: K=50 candidates, diversity filter, entity heatmap with fixed caps. This setup is effective for temporal and multi-hop questions, where the target memory is specific and the risk of irrelevant candidates is high. For open-domain aggregation queries, the same setup starves the model of breadth.
The fix follows directly from this diagnosis: adaptive K, a query-type classifier that scales retrieval width by query breadth, routes open-domain questions to a K×6 search path (up to 300 initial candidates) before reranking. With adaptive K and entity-hyperedge expansion (which adds entity-connected memories not in the cosine top-K), open-domain reaches 0.8216 — exceeding mem0's 0.8109 by +0.011. The improvement is concentrated in questions asking about a person's general characteristics and interests, where the wider candidate pool allows Cohere to select a more representative cross-section of memories.
This result reframes the open-domain finding: the original regression at K=50 was not evidence that the temporal tree architecture is fundamentally ill-suited to broad retrieval. It was evidence that retrieval width must be matched to query breadth. The architecture supports both modes; the adaptive classifier is the bridge.
6. Future Work
Four directions follow naturally from the findings and limitations identified in this paper.
Branch summary quality depends on tree quality. Fork-to-fork narrative summaries are implemented and live as a retrieval navigation aid, but the current experiment tree was built with a loose parent assignment threshold (0.60 cosine) that produced hub-heavy topology. Summaries generated on a cleaner tree (0.85 threshold) are expected to be more coherent and provide stronger branch navigation signal. Re-ingesting with the corrected threshold and re-running the summary generation job is the immediate next experiment.
Persistent typed memory edges. The current entity-hyperedge expansion derives connections on the fly via entity co-reference in memory_entities. A persistent edge table with LLM-written edge descriptions — "these two memories contradict each other about Caroline's timeline" — would enable richer traversal, pruning, and explanation. Typed edges (contradiction, refinement, temporal succession, causal) between fork nodes and summary nodes would form an explicit hyperedge layer above the temporal tree. Edge descriptions are themselves first-class memories in the model — facts about the relationship between two other facts — enabling a self-describing memory graph where the structure and its meaning are stored in the same representation.
Embedding fine-tuning. The query-statement semantic gap — questions are phrased differently from the facts that answer them — is currently bridged by wide retrieval (K=200) and cross-encoder reranking. A bi-encoder fine-tuned on (question, memory) pairs from LOCOMO or similar corpora would close this gap at the retrieval stage itself, reducing dependence on Cohere and enabling competitive performance at smaller K. The wide-K finding (+0.123 from K=50 to K=200) quantifies exactly how much room a better bi-encoder has to recover.
LLM-driven graph rewriting. The current deterministic maintenance pass (§3.3) handles structural hygiene without LLM calls. A second pass could apply LLM reasoning to consolidate a memory cluster into a single canonical sentence and generate bridging memories that connect nearby tree branches sharing implicit context. This is the long-horizon complement to the write-time tree structure: where the tree encodes relationships as they arrive, a rewriting pass repairs and enriches the structure as the corpus matures.
ONNX self-hosted reranking. Cohere's rerank-v3.5 is the current cross-encoder backend. Deploying a quantized cross-encoder (e.g., ms-marco-MiniLM-L6) locally via ONNX would remove the external API dependency for production deployments where latency and data residency matter. Our ablation shows the reranker identity contributes only +0.002 in isolation — the cost of switching backends is low.
7. Conclusion
We presented Temporal Tree Memory, a memory architecture for AI agents in which every stored fact is assigned a position in a tree at write time. The tree encodes time as topology (root to leaf), fact refinement as parent-child, and contradiction as forking branches. Retrieval becomes branch traversal: a semantic entry point, followed by structured expansion through tree descendants, entity co-reference neighbors, and fork-to-fork narrative summaries. A complementary entity heatmap ensures that memories connecting multiple named entities — the load-bearing signal for multi-hop reasoning — are always represented in context regardless of their cosine rank.
On LOCOMO, our system achieves an LLM judge score of 0.8227 (Claude Sonnet, wide retrieval) and 0.8000 (gpt-4.1-mini), compared to 0.6383 for mem0 on the same harness. Targeted category-level enhancements (adaptive K, entity-hyperedge expansion, recency-aware scoring) further improve multi-hop to 0.7917 and close the open-domain gap vs mem0, though overall headline score trades off slightly (0.8201 vs 0.8227; see §4.2). The gains are concentrated exactly where the architecture is designed to help: temporal reasoning (+0.486 over mem0) and multi-hop reasoning (+0.135 with wide retrieval, rising to +0.250 with the additional enhancements described in §4.2). A systematic ablation identifies four independent contributions and a clear hierarchy: memory structure first, retrieval coverage second, reranker identity near-zero, reasoning model marginal.
Three findings from the ablation are practically important beyond this paper. First, the choice of cross-encoder reranker contributes +0.002 in isolation — it does not drive performance. The dominant variable is whether the right memory is in the candidate pool, not how candidates are reranked once retrieved. Second, retrieval width is a first-class architectural concern: expanding from K=50 to K=200 before reranking adds +0.123, larger than the structural contribution itself. Systems that fix K for cost reasons accept a hard recall ceiling. Third, the LLM upgrade from gpt-4.1-mini to Claude Sonnet contributes +0.008 at K=50 and +0.023 at wide-K — noise-level at equivalent retrieval width. For most memory-augmented agent applications, the bottleneck is retrieval coverage, not reasoning capability. Investing in memory structure and retrieval width outperforms investing in a more capable model.
The open-domain regression that exists at K=50 closes with adaptive retrieval width: routing aggregation queries to a K×6 search path recovers and exceeds mem0's open-domain score. This reframes the tradeoff: precision-optimized retrieval is not inherently worse at broad queries, it simply requires a query-type classifier to match retrieval width to query breadth.
The evaluation harness, benchmark results, and Python SDK will be released publicly upon acceptance. We believe the core finding — that structure at write time, not post-hoc clustering, is what enables temporal and multi-hop reasoning — will generalize beyond this benchmark to any memory-augmented agent operating over long conversation histories.