Skip to main content

Commands

run

Execute the full benchmark pipeline.
bun run src/index.ts run -p <provider> -b <benchmark> -j <judge> -r <run-id>
OptionDescription
-p, --providerMemory provider (supermemory, mem0, zep)
-b, --benchmarkBenchmark (locomo, longmemeval, convomem)
-j, --judgeJudge model (default: gpt-4o)
-r, --run-idRun identifier (auto-generated if omitted)
-m, --answering-modelModel for answer generation (default: gpt-4o)
-l, --limitLimit number of questions
-s, --sampleSample N questions per category
--sample-typeSampling strategy: consecutive (default), random
--forceClear checkpoint and restart
See Supported Models for all available judge and answering models.

compare

Run benchmark across multiple providers in parallel.
bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j gpt-4o

test

Evaluate a single question for debugging.
bun run src/index.ts test -r <run-id> -q <question-id>

status

Check progress of a run.
bun run src/index.ts status -r <run-id>

show-failures

Debug failed questions with full context.
bun run src/index.ts show-failures -r <run-id>

list-questions

Browse benchmark questions.
bun run src/index.ts list-questions -b <benchmark>

Random Sampling

Sample N questions per category with optional randomization.
bun run src/index.ts run -p supermemory -b longmemeval -s 3 --sample-type random

serve

Start the web UI.
bun run src/index.ts serve
Opens at http://localhost:3000.

help

Get help on providers, models, or benchmarks.
bun run src/index.ts help providers
bun run src/index.ts help models
bun run src/index.ts help benchmarks

Checkpointing

Runs are saved to data/runs/{runId}/ and automatically resume from the last successful phase. Use --force to restart.