Skip to content

Embedding Engine

Ghost uses a fallback chain architecture for generating text embeddings used in semantic search.

┌─────────────────────────────┐
│ Native Candle Engine │ Priority 1: In-process
│ all-MiniLM-L6-v2 (384D) │ Zero external dependencies
├─────────────────────────────┤
│ Ollama Engine │ Priority 2: HTTP fallback
│ nomic-embed-text (768D) │ Higher quality, needs Ollama
├─────────────────────────────┤
│ FTS5 Only │ Priority 3: Degraded
│ Keyword search only │ Always works, no vectors
└─────────────────────────────┘

Model: all-MiniLM-L6-v2

PropertyValue
Dimensions384
Model size~23MB (safetensors)
FrameworkCandle (pure Rust)
TokenizerHuggingFace tokenizers
PlatformAny CPU (x86_64, ARM64)
SIMDAVX2 (Intel/AMD), NEON (ARM)
First loadDownloads from HuggingFace Hub
Subsequent loads<200ms from cache
  1. Text is tokenized using HuggingFace tokenizers crate
  2. Tokens pass through BERT model layers via Candle
  3. Mean pooling over token embeddings produces a 384-dim vector
  4. Vector is stored in sqlite-vec for KNN search

Ghost automatically detects optimal settings:

pub struct HardwareInfo {
pub cpu_cores: usize,
pub has_avx2: bool, // x86 SIMD
pub has_neon: bool, // ARM SIMD
pub gpu_backend: Option<GpuBackend>, // CUDA/Metal/Vulkan
pub total_ram_mb: u64,
}

If the native engine fails (e.g., model download interrupted), Ghost falls back to Ollama:

  • Model: nomic-embed-text (768 dimensions)
  • API: POST http://localhost:11434/api/embeddings
  • Quality: Higher quality than MiniLM but requires Ollama running

The database adapts to whichever engine is active:

EngineDimensionssqlite-vec table
Native384FLOAT[384]
Ollama768FLOAT[768]

When switching engines, vectors are re-generated for consistency.

Before embedding, documents are split into chunks:

  • Chunk size: 512 tokens
  • Overlap: 64 tokens
  • Strategy: Sentence-boundary aware splitting

This ensures each vector represents a coherent semantic unit while maintaining context across boundaries.