System Overview
Core Components
| Component | Role |
|---|---|
| Benchmarks | Load test data and provide questions with ground truth answers |
| Providers | Memory services being evaluated (handle ingestion and search) |
| Judges | LLM-based evaluators that score answers against ground truth |
Pipeline
| Phase | What Happens |
|---|---|
| Ingest | Load benchmark sessions → Push to provider |
| Index | Wait for provider indexing |
| Search | Query provider → Retrieve context |
| Answer | Build prompt → Generate answer via LLM |
| Evaluate | Compare to ground truth → Score via judge |
| Report | Aggregate scores → Output accuracy + latency |
Advanced Checkpointing
Runs persist todata/runs/{runId}/:
--force to restart.