Self-Hosting Configuration

The self-hosted server aims for zero configuration — the only required input is one LLM provider key, which the first-boot wizard collects interactively (or set via env var for non-interactive deployments). Embeddings default to local English; you can pick another provider in the optional wizard step or via env. Everything else below is opt-in. The installer writes API keys to ~/.supermemory/env, which is loaded on every launch. You can also set variables in your shell or a process manager.

Core

Variable	Purpose	Default
`PORT` (or `SUPERMEMORY_PORT`)	HTTP listen port	`6767`
`SUPERMEMORY_DATA_DIR`	Where the graph engine’s data, auth secret, and model cache live	`./.supermemory`

LLM providers

In production, Supermemory uses its own proprietary models tuned for long-horizon data understanding. Self-hosted, you bring your own LLM for the intelligent steps — summaries, contextual chunking, and memory extraction. Embeddings default to a local model (no API key) and can optionally use OpenAI, Gemini, or Ollama — see Embeddings. Configure at least one LLM provider:

Variable	Provider
`OPENAI_API_KEY`	OpenAI — or any OpenAI-compatible endpoint, see below
`ANTHROPIC_API_KEY`	Anthropic
`GEMINI_API_KEY`	Google AI Studio (Gemini)
`GROQ_API_KEY`	Groq
`WORKERS_AI_API_KEY` + `CLOUDFLARE_ACCOUNT_ID`	Cloudflare Workers AI
`GOOGLE_VERTEX_PROJECT_ID` + `GOOGLE_VERTEX_LOCATION`	GCP Vertex AI

No key set? The server walks you through it. On first boot, an interactive setup wizard asks which provider you want, securely prompts for the key, and saves it encrypted — including a custom base URL and model name if you pick an OpenAI-compatible endpoint.

With multiple providers configured, the first one in the order above is used.

Image, video, and high-fidelity PDF understanding require a Gemini or Vertex AI key. Text ingestion, memory extraction, and search work with any provider.

Fully offline with local models

OPENAI_API_KEY + OPENAI_BASE_URL covers any OpenAI-compatible endpoint: Ollama, LM Studio, vLLM, llama.cpp server, Together, Fireworks, and more.

# Ollama example — gpt-oss-20b works great
OPENAI_BASE_URL=http://localhost:11434/v1
OPENAI_API_KEY=ollama        # any non-empty string for local runners
OPENAI_MODEL=gpt-oss:20b

Variable	Purpose	Default
`OPENAI_BASE_URL`	OpenAI-compatible endpoint URL	OpenAI
`OPENAI_MODEL`	Model ID sent to that endpoint	`gpt-5.1`
`OPENAI_FAST_MODEL`	Override for fast/light tasks	`OPENAI_MODEL`
`OPENAI_TEXT_MODEL`	Override for heavier text tasks	`OPENAI_MODEL`

File storage

Nothing to configure. Uploaded files (PDFs, images) are stored on local disk inside $SUPERMEMORY_DATA_DIR and served by the server at /files/:key.

Embeddings

By default, vectors are computed locally with Xenova/bge-base-en-v1.5 (768d) — no embedding API key. On interactive first boot you can pick a different provider after the LLM key step; for Docker/CI set env vars instead. Full provider table, multilingual guidance, remote examples (OpenAI / Gemini / Ollama), and the re-ingestion / dimension-lock warning: Embeddings (self-hosted).

Variable	Purpose	Default
`SUPERMEMORY_EMBEDDING_PROVIDER`	`local`, `openai`, `gemini`, or OpenAI-compatible remote	`local`
`SUPERMEMORY_EMBEDDING_MODEL`	Model id for the chosen provider	`Xenova/bge-base-en-v1.5`
`SUPERMEMORY_EMBEDDING_DIMENSIONS`	Vector size; must match model and stored data	`768`
`SUPERMEMORY_EMBEDDING_BASE_URL`	Base URL for OpenAI-compatible embedding APIs	unset

Embedding performance

Local embeddings are prewarmed at startup with conservative defaults — one worker, minimal CPU footprint. Turn these up if you’re ingesting heavily and prefer throughput over headroom:

Variable	Purpose	Default
`SUPERMEMORY_LOCAL_EMBEDDING_POOL_SIZE`	Number of embedding workers	`1`
`SUPERMEMORY_LOCAL_EMBEDDING_WASM_THREADS`	Compute threads per worker	`1`
`SUPERMEMORY_LOCAL_EMBEDDING_BATCH_SIZE`	Texts per worker dispatch	`8`
`SUPERMEMORY_LOCAL_EMBEDDING_IDLE_TIMEOUT_MS`	Idle time before workers shut down	`120000`
`SUPERMEMORY_SKIP_EMBEDDING_PREWARM`	Skip startup prewarm, load on first use	unset

Memory limits & ingestion queue

The server manages memory for you and separates the two kinds of work you send it:

Searches are always served immediately. They never wait behind ingestion, regardless of how much is queued.
Adds are accepted instantly but processed through a queue. A POST /v3/documents call returns in milliseconds with status queued; extraction, embedding, and indexing happen in the background at a controlled pace.

Ingestion may grow the server’s memory usage by at most SUPERMEMORY_EMBEDDING_RAM_LIMIT (default 1 GB) above its post-boot baseline. Past that, new documents simply wait in the queue until memory drops back under the limit — nothing is dropped, ingestion just slows down. The limit is measured above the boot baseline because the built-in local embeddings and storage engine have a fixed footprint that exists before any document is processed. The limit is printed at boot, and whenever adds are waiting the binary shows a live status line in the terminal:

[ingest] memory limit 1.0 GB above baseline (1.6 GB) · 2 concurrent — set SUPERMEMORY_EMBEDDING_RAM_LIMIT=ngb to change
[ingest] 2 running · 193 queued · 0.4 GB / 1.0 GB ingest memory
[ingest] 2 running · 193 queued · paused — 1.1 GB / 1.0 GB ingest memory, waiting for it to drop
[ingest] resumed — memory back under the 1.0 GB ingest limit

Variable	Purpose	Default
`SUPERMEMORY_EMBEDDING_RAM_LIMIT`	Memory ingestion may use above the boot baseline. Accepts `1gb`, `1.5gb`, `512mb`, or a bare number (GB).	`1gb`
`SUPERMEMORY_INGEST_CONCURRENCY`	Documents processed concurrently	`2`

# Give ingestion 4 GB of headroom on a larger machine
SUPERMEMORY_EMBEDDING_RAM_LIMIT=4gb ./supermemory-server

Raise the limit and concurrency on machines with spare RAM for faster bulk imports; lower them on small VPSes where you want the server to stay lean and don’t mind adds draining slowly.

Telemetry

The self-hosted binary sends no analytics — there is nothing to opt out of. The only related switch:

Variable	Purpose	Default
`SUPERMEMORY_DISABLE_TELEMETRY`	Set to `1` to also disable internal AI SDK telemetry instrumentation	unset

Platform-only features

These exist in the codebase but are exclusive to the hosted platform — the self-hosted binary doesn’t include them:

Connectors — Google Drive, Notion, Gmail, OneDrive background sync
Supermemory MCP — managed MCP server endpoints
Optimized memory extraction — the platform’s extraction pipeline is tuned for higher quality at lower cost than bring-your-own-key
Managed scale — globally distributed infrastructure, no capacity planning

Any other environment variables you may find referenced in the codebase are platform-only: the self-hosted binary ignores them even when set.

Example: production-ish `.env`

# Persistent data location
SUPERMEMORY_DATA_DIR=/var/lib/supermemory

# One LLM provider (required for extraction)
OPENAI_API_KEY=sk-...

# Optional — omit to keep local Xenova/bge-base-en-v1.5 (768d)
# SUPERMEMORY_EMBEDDING_PROVIDER=openai
# SUPERMEMORY_EMBEDDING_MODEL=text-embedding-3-small
# SUPERMEMORY_EMBEDDING_DIMENSIONS=1536

That’s enough for full ingestion, memory extraction, and hybrid search with the default local embeddings.

Getting Started

Self-Hosting

Concepts

Using supermemory

Connectors and sync

Migration Guides

Core

LLM providers

Fully offline with local models

File storage

Embeddings

Embedding performance

Memory limits & ingestion queue

Telemetry

Platform-only features

Example: production-ish `.env`

​Core

​LLM providers

​Fully offline with local models

​File storage

​Embeddings

​Embedding performance

​Memory limits & ingestion queue

​Telemetry

​Platform-only features

​Example: production-ish .env

Core

LLM providers

Fully offline with local models

File storage

Embeddings

Embedding performance

Memory limits & ingestion queue

Telemetry

Platform-only features

Example: production-ish `.env`