embed
Generate, refresh, or clean embeddings in a JSONL file.
omnigraph embed is an offline JSONL pipeline that fills, refreshes, or
strips embedding columns in a .jsonl data file. It operates on files. Not
on a live graph. So you can prepare data for load / ingest without
round-tripping through the engine.
The compiler-side embedder (text-embedding-3-small by default) generates
query-time normalized vectors. The engine-side embedder
(gemini-embedding-2-preview) runs automatically at write time for
@embed-annotated vector columns; you only need omnigraph embed when you
want to prebake embeddings outside the engine.
Modes
The three modes are mutually exclusive:
| Flag | Behavior |
|---|---|
(default) fill_missing | Only embed rows whose target field is empty |
--reembed-all | Recompute every row's embedding, overwriting existing values |
--clean | Strip embedding columns from every row |
Usage
Two input shapes are supported. They are mutually exclusive:
# Driven by a seed manifest YAML (defines sources, artifacts, and the embed spec inline)
omnigraph embed --seed ./seed.yaml [--reembed-all|--clean]
# Direct: explicit input / output JSONL plus a standalone embed spec JSON file
omnigraph embed --input data.jsonl --output data.embedded.jsonl --spec embed-spec.json--seed is incompatible with --input / --output / --spec; pick one
input shape per invocation.
Options
| Option | Required | Description |
|---|---|---|
--input | with --output + --spec | Source JSONL file |
--output | with --input + --spec | Destination JSONL file |
--spec | with --input + --output | Path to an embed-spec JSON file (see below) |
--seed | alternative to the trio above | Path to a seed manifest YAML describing inputs and embed specs inline |
--type | no | Repeatable. Embed only rows of the given node / edge type. |
--select | no | Repeatable filter, T:field=value or field=value. |
--reembed-all | no | Overwrite existing embeddings |
--clean | no | Strip embedding columns instead of writing new ones |
Embed-spec JSON shape
The spec file describes the embedding model, dimension, and the mapping of each entity type to its target embedding column and the source fields whose text gets embedded:
{
"model": "gemini-embedding-2-preview",
"dimension": 1536,
"types": {
"Document": { "target": "embedding", "fields": ["body"] },
"Person": { "target": "embedding", "fields": ["bio"] }
}
}model is optional and defaults to gemini-embedding-2-preview. The same
shape can live under embeddings: inside a seed manifest YAML.
Environment
The compiler-side embedder reads the following environment variables:
| Var | Default | Purpose |
|---|---|---|
NANOGRAPH_EMBED_MODEL | text-embedding-3-small | Model identifier |
OPENAI_API_KEY | — | API credentials |
OPENAI_BASE_URL | https://api.openai.com/v1 | API endpoint |
NANOGRAPH_EMBED_TIMEOUT_MS | 30000 | Per-request timeout |
NANOGRAPH_EMBED_RETRY_ATTEMPTS | 4 | Retry budget for transient errors |
NANOGRAPH_EMBEDDINGS_MOCK | unset | Deterministic mock embedder for tests |
Example
Bake embeddings into a seed file before initial load:
omnigraph embed --input ./seed.jsonl \
--output ./seed.embedded.jsonl \
--spec ./embed-spec.json \
--type Document
omnigraph init --schema ./schema.pg ./graph.omni
omnigraph load ./graph.omni --data ./seed.embedded.jsonl --mode overwriteRefresh every existing embedding (e.g. after switching models):
omnigraph embed --input ./data.jsonl \
--output ./data.embedded.jsonl \
--spec ./embed-spec.json \
--reembed-all