Introduction
Large Language Models (LLMs) fundamentally suffer from “forgetting”. They treat every interaction as a new, discrete event, lacking the persistent continuity required for personalized user experiences. While Context Windows are growing, LLMs are still prone to forgetting context in the middle of the context window as shown by Liu et al. [1] and high latency.
In this report, we introduce Supermemory, a memory engine designed to solve the problem of long-term coherence. We demonstrate that Supermemory achieves State-of-the-Art (SOTA) results on LongMemEval_s [2], effectively solving the challenges of temporal reasoning and knowledge conflicts in high-noise environments (115k+ tokens).
Authors

AI Researcher, Supermemory

AI Researcher, Supermemory

CEO, Supermemory
The Evaluation Landscape: Why LongMemEval?
Current benchmarks in the LLM memory space often fail to capture the chaos of real-world production environments. Benchmarks like LoCoMo [3] are insufficient for modern models due to their limited context size and lack of knowledge updates (testing ability to overwrite or update old and obsolete information with newer information).
We utilized LongMemEval [2] for validation because it represents the most rigorous approximation of real-world chat history. It challenges the retrieval system not just on recall, but on reasoning over time and filtering out noise. Unlike other benchmarks (which present human-human interaction), LongMemEval [2] tests for human-assistant interactions, which is representative of real-world usage, as also highlighted by Rasmussen et al. [4]
LongMemEval_s [2] spans 500 questions split into 6 categories and evaluates five core memory capabilities:
Information Extraction: Accurately extracting and storing factual information from conversations. Categories:
single-session-user: Retrieving literal context mentioned by the user within a single session.
single-session-assistant: Retrieving literal context mentioned by the assistant within a single session.
single-session-preference: Extracting implicit user preferences to inform personalized responses.
Multi-Session Reasoning: Synthesizing information scattered across multiple conversation sessions. Categories: multi-session
Knowledge Update: Handling scenarios where newer information contradicts or supersedes older facts. Categories: knowledge-update
Temporal Reasoning: Understanding the sequence of events, calculating time intervals, and reasoning about relative timestamps. Categories: temporal-reasoning
Abstaining on Unanswerable Questions: Recognizing when sufficient information is not available and appropriately declining to answer. Categories: all
These capabilities cover a broad variety of general real-world use-cases.
Methodology: Supermemory’s Architecture
Supermemory outperforms existing solutions by minimizing semantic ambiguity, which is a big reason for context not being utilized effectively in LLMs as demonstrated by Keluskar et al. [5] We achieve this by coupling memories with temporal metadata, relations, and raw chunks.
Chunk-based Ingestion & Contextual Memories
Standard RAG (Retrieval Augmented Generation) [6] often fails because it retrieves raw chunks that lack context when isolated from the conversation [7]
Chunking: We decompose large sessions into manageable semantic blocks.
Memory Generation: As we index the chunk, we also generate memories — single (atomic) pieces of information that resolve ambiguous references within the chunk using a modified version of Contextual Retrieval [8].
Relational Versioning & Knowledge Chains
Supermemory also defines semantic relationships between new and existing memories. This allows us to map evolution of facts:
updates (State Mutation): Handles contradictions or corrections (e.g., "My favorite color is now Green" updates "My favorite color is Blue"), creating a version history of sorts.
updates
extends (Refinement): Supplements existing nodes with new details without contradiction (e.g., adding a job title to an existing employment memory).
extends
derives (Inference): Captures second-order logic inferred from combining multiple distinct memories.
derives
Temporal Grounding
A core differentiator in our architecture is the dual-layer time-stamping approach, which drove our high scores in the temporal-reasoning , knowledge-update, and multi-session categories. For every memory, we extract:
documentDate : : The time the conversation took place.
documentDate
eventDate : : The extracted timestamp of when the event described in the conversation actually occurred.
eventDate
Hybrid Search Strategy
We perform semantic search on the memories to identify relevant concepts. As memories encapsulate singular pieces of information, i.e. high signal and low noise, this is generally more accurate than directly searching for the noisy chunks as noted by several sources [9][10].
Once a hit is found, we inject the original source chunk for the memory into the result output. This allows the LLM to access the "finer details" required for nuance while relying on the atomicity of the memory for high-precision retrieval. This serves to resolve the concern of information loss brought up in Section 5.2 of the LongMemEval [2] paper.
Session-Based Ingestion
Unlike the evaluation methodology described in the LongMemEval [2] paper, which processes conversation history round-by-round (one user message followed by one assistant response), We choose to ingest the dataset session-by-session.
Performance Results
Supermemory demonstrates superior performance across all categories on LongMemEval_s. The system shows particular strength in Multi Session (71.43%) and Temporal Reasoning (76.69%), areas where standard vector-store approaches historically struggle.
LLM-as-Judge Evaluation
How to reproduce these results
We believe in transparency and rigorous cross-validation.
Data & Prompts: The full prompt used for answering is provided in our appendix. For answer evaluation, we used
gpt-4owith the question-specific prompts provided in the LongMemEval paper [2].
Codebase: The ingestion pipeline, search logic, and evaluation scripts are available in our GitHub repository.
Conclusion
The ability to accurately recall user details, respect temporal sequences, and update knowledge over time is not a "feature" - it is a prerequisite for Agentic AI.
By moving beyond simple vector similarity and implementing this form of disambiguation, Supermemory provides a robust backend for enterprise applications. It transforms the LLM from a stateless processor into a stateful assistant, capable of maintaining long-term, personalized user narratives with high fidelity.
Citations
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157-173.
Wu, D., Wang, H., Yu, W., Zhang, Y., Chang, K. W., & Yu, D. (2024). Longmemeval: Benchmarking chat assistants on long-term interactive memory. arXiv preprint arXiv:2410.10813.
Maharana, A., Lee, D. H., Tulyakov, S., Bansal, M., Barbieri, F., & Fang, Y. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.
Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: a temporal knowledge graph architecture for agent memory. arXiv preprint arXiv:2501.13956.
Keluskar, A., Bhattacharjee, A., & Liu, H. (2024, December). Do llms understand ambiguity in text? a case study in open-world question answering. In 2024 IEEE International Conference on Big Data (BigData) (pp. 7485-7490). IEEE.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33, 9459-9474.
Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., & Abdelrazek, M. (2024, April). Seven failure points when engineering a retrieval augmented generation system. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI (pp. 194-199).
Ford, D. (2024, September). Introducing Contextual Retrieval. In Anthropic Engineering Blog. [Online].
https://www.anthropic.com/engineering/contextual-retrieval ↗
Doval, Y., Vilares, J., & Gómez-Rodríguez, C. (2020). Towards robust word embeddings for noisy texts. Applied Sciences, 10(19), 6893.
Shah P. (2024, August). The Effects of Data Noise on the Efficiency of Vector Search Algorithms. In LinkedIn Pulse. [Online].
https://www.linkedin.com/pulse/effects-data-noise-efficiency-vector-search-algorithms-pankil-shah-4pwfe/ ↗
Appendix
Answering Prompt
`You are a question-answering system. Based on the retrieved context below, answer the question.
Question: ${question}
Question Date: ${questionDate}
Retrieved Context:
${retrievedContext}
Understanding the Context:
The context contains search results from a memory system. Each result has multiple components you can use:
Memory: A high-level summary/atomic fact (e.g., "Alex loves hiking in mountains", "John reports to Maria")
This is the searchable title/summary of what was stored
Chunks: The actual detailed raw content where the memory was extracted from
Contains conversations, documents, messages, or text excerpts
This is your primary source for detailed information and facts
Look here for specifics, context, quotes, and evidence
Temporal Context (if present):
Question Date: The date when the question was asked (provided above). Use this to understand the temporal perspective of the question.
documentDate: ISO date string for when the content was originally authored/written/said by the user (NOT the system createdAt timestamp). This is the reference point for calculating relative dates. Extract from document metadata, timestamps, or context.
eventDate: Array of ISO date strings for when the event/fact being referenced actually occurred or will occur. Always provided as an array, even for single dates. For past events use past dates, for future events use future dates. Calculate relative dates (today, yesterday, last week) based on documentDate, NOT the current date.
Useful for time-based questions (what happened when, recent vs old info)
Important: When you see relative terms like "today", "yesterday", calculate them relative to the documentDate, NOT the current date. The question date helps you understand the temporal context of what the user is asking about.
Profile Data (if present):
Static Profile: Permanent user characteristics (name, preferences, core identity)
Dynamic Profile: Contains a subset of the recently added memories
Provides background about the user
Version: Shows if a memory has been updated/extended over time
How to Answer:
Start by scanning memory titles to find relevant results
Read the chunks carefully - they contain the actual details you need
Use temporal context to understand when things happened
Use profile data for background about the user
Synthesize information from multiple results if needed
Instructions:
If the context contains enough information to answer the question, provide a clear, concise answer
If the context does not contain enough information, respond with "I don't know" or explain what information is missing
Base your answer ONLY on the provided context
Prioritize information from chunks - they're the raw source material
Answer:`
Results for Zep were taken from their paper [4].





