CLI Reference

Commands

run

Execute the full benchmark pipeline.

bun run src/index.ts run -p <provider> -b <benchmark> -j <judge> -r <run-id>

Option	Description
`-p, --provider`	Memory provider (`supermemory`, `mem0`, `zep`)
`-b, --benchmark`	Benchmark (`locomo`, `longmemeval`, `convomem`)
`-j, --judge`	Judge model (default: `gpt-4o`)
`-r, --run-id`	Run identifier (auto-generated if omitted)
`-m, --answering-model`	Model for answer generation (default: `gpt-4o`)
`-l, --limit`	Limit number of questions
`-s, --sample`	Sample N questions per category
`--sample-type`	Sampling strategy: `consecutive` (default), `random`
`--force`	Clear checkpoint and restart

See Supported Models for all available judge and answering models.

compare

Run benchmark across multiple providers in parallel.

bun run src/index.ts compare -p supermemory,mem0,zep -b locomo -j gpt-4o

test

Evaluate a single question for debugging.

bun run src/index.ts test -r <run-id> -q <question-id>

status

Check progress of a run.

bun run src/index.ts status -r <run-id>

show-failures

Debug failed questions with full context.

bun run src/index.ts show-failures -r <run-id>

list-questions

Browse benchmark questions.

bun run src/index.ts list-questions -b <benchmark>

Random Sampling

Sample N questions per category with optional randomization.

bun run src/index.ts run -p supermemory -b longmemeval -s 3 --sample-type random

serve

Start the web UI.

bun run src/index.ts serve

Opens at http://localhost:3000.

help

Get help on providers, models, or benchmarks.

bun run src/index.ts help providers
bun run src/index.ts help models
bun run src/index.ts help benchmarks

Checkpointing

Runs are saved to data/runs/{runId}/ and automatically resume from the last successful phase. Use --force to restart.

Getting Started

Development

Reference

Commands

run

compare

test

status

show-failures

list-questions

Random Sampling

serve

help

Checkpointing

Getting Started

Development

Reference

​Commands

​run

​compare

​test

​status

​show-failures

​list-questions

​Random Sampling

​serve

​help

​Checkpointing

Commands

run

compare

test

status

show-failures

list-questions

Random Sampling

serve

help

Checkpointing