Commands
run
Execute the full benchmark pipeline.| Option | Description |
|---|---|
-p, --provider | Memory provider (supermemory, mem0, zep) |
-b, --benchmark | Benchmark (locomo, longmemeval, convomem) |
-j, --judge | Judge model (default: gpt-4o) |
-r, --run-id | Run identifier (auto-generated if omitted) |
-m, --answering-model | Model for answer generation (default: gpt-4o) |
-l, --limit | Limit number of questions |
-s, --sample | Sample N questions per category |
--sample-type | Sampling strategy: consecutive (default), random |
--force | Clear checkpoint and restart |
compare
Run benchmark across multiple providers in parallel.test
Evaluate a single question for debugging.status
Check progress of a run.show-failures
Debug failed questions with full context.list-questions
Browse benchmark questions.Random Sampling
Sample N questions per category with optional randomization.serve
Start the web UI.help
Get help on providers, models, or benchmarks.Checkpointing
Runs are saved todata/runs/{runId}/ and automatically resume from the last successful phase. Use --force to restart.