embed

omnigraph embed is an offline JSONL pipeline that fills, refreshes, or strips embedding columns in a .jsonl data file. It operates on files. Not on a live graph. So you can prepare data for load / ingest without round-tripping through the engine.

The compiler-side embedder (text-embedding-3-small by default) generates query-time normalized vectors. The engine-side embedder (gemini-embedding-2-preview) runs automatically at write time for @embed-annotated vector columns; you only need omnigraph embed when you want to prebake embeddings outside the engine.

Modes

The three modes are mutually exclusive:

Flag	Behavior
(default) `fill_missing`	Only embed rows whose target field is empty
`--reembed-all`	Recompute every row's embedding, overwriting existing values
`--clean`	Strip embedding columns from every row

Usage

Two input shapes are supported. They are mutually exclusive:

# Driven by a seed manifest YAML (defines sources, artifacts, and the embed spec inline)
omnigraph embed --seed ./seed.yaml [--reembed-all|--clean]

# Direct: explicit input / output JSONL plus a standalone embed spec JSON file
omnigraph embed --input data.jsonl --output data.embedded.jsonl --spec embed-spec.json

--seed is incompatible with --input / --output / --spec; pick one input shape per invocation.

Options

Option	Required	Description
`--input`	with `--output` + `--spec`	Source JSONL file
`--output`	with `--input` + `--spec`	Destination JSONL file
`--spec`	with `--input` + `--output`	Path to an embed-spec JSON file (see below)
`--seed`	alternative to the trio above	Path to a seed manifest YAML describing inputs and embed specs inline
`--type`	no	Repeatable. Embed only rows of the given node / edge type.
`--select`	no	Repeatable filter, `T:field=value` or `field=value`.
`--reembed-all`	no	Overwrite existing embeddings
`--clean`	no	Strip embedding columns instead of writing new ones

Embed-spec JSON shape

The spec file describes the embedding model, dimension, and the mapping of each entity type to its target embedding column and the source fields whose text gets embedded:

{
  "model": "gemini-embedding-2-preview",
  "dimension": 1536,
  "types": {
    "Document": { "target": "embedding", "fields": ["body"] },
    "Person":   { "target": "embedding", "fields": ["bio"] }
  }
}

model is optional and defaults to gemini-embedding-2-preview. The same shape can live under embeddings: inside a seed manifest YAML.

Environment

The compiler-side embedder reads the following environment variables:

Var	Default	Purpose
`NANOGRAPH_EMBED_MODEL`	`text-embedding-3-small`	Model identifier
`OPENAI_API_KEY`	—	API credentials
`OPENAI_BASE_URL`	`https://api.openai.com/v1`	API endpoint
`NANOGRAPH_EMBED_TIMEOUT_MS`	`30000`	Per-request timeout
`NANOGRAPH_EMBED_RETRY_ATTEMPTS`	`4`	Retry budget for transient errors
`NANOGRAPH_EMBEDDINGS_MOCK`	unset	Deterministic mock embedder for tests

Example

Bake embeddings into a seed file before initial load:

omnigraph embed --input ./seed.jsonl \
    --output ./seed.embedded.jsonl \
    --spec ./embed-spec.json \
    --type Document

omnigraph init --schema ./schema.pg ./graph.omni
omnigraph load ./graph.omni --data ./seed.embedded.jsonl --mode overwrite

Refresh every existing embedding (e.g. after switching models):

omnigraph embed --input ./data.jsonl \
    --output ./data.embedded.jsonl \
    --spec ./embed-spec.json \
    --reembed-all