Skip to main content

System Overview

Core Components

ComponentRole
BenchmarksLoad test data and provide questions with ground truth answers
ProvidersMemory services being evaluated (handle ingestion and search)
JudgesLLM-based evaluators that score answers against ground truth
See Integrations for all supported benchmarks, providers, and models.

Pipeline

PhaseWhat Happens
IngestLoad benchmark sessions → Push to provider
IndexWait for provider indexing
SearchQuery provider → Retrieve context
AnswerBuild prompt → Generate answer via LLM
EvaluateCompare to ground truth → Score via judge
ReportAggregate scores → Output accuracy + latency
Each phase checkpoints independently. Failed runs resume from last successful point.

Advanced Checkpointing

Runs persist to data/runs/{runId}/:
data/runs/my-run/
├── checkpoint.json    # Run state and progress
├── results/           # Search results per question
└── report.json        # Final report
Re-running same ID resumes. Use --force to restart.

File Structure

src/
├── cli/commands/             # run, compare, test, serve, status...
├── orchestrator/phases/      # ingest, search, answer, evaluate, report
├── benchmarks/
│   └── <name>/index.ts       # e.g. locomo/, longmemeval/, convomem/
├── providers/
│   └── <name>/
│       ├── index.ts          # Provider implementation
│       └── prompts.ts        # Custom prompts (optional)
├── judges/                   # openai.ts, anthropic.ts, google.ts
└── types/                    # provider.ts, benchmark.ts, unified.ts