Skip to content

Native AI Engine

Ghost runs AI models natively in-process using Candle, HuggingFace’s Rust ML framework. No Ollama, no Python, no GPU required.

  • Dimensions: 384
  • Model size: ~23MB (safetensors)
  • Inference: In-process via Candle — zero HTTP overhead
  • Platform: Any CPU (x86_64 or ARM64)
  • SIMD: Auto-detects AVX2 (Intel/AMD) or NEON (ARM)
// How Ghost generates embeddings natively
let engine = NativeEngine::load().await?;
let embedding: Vec<f32> = engine.embed("search query")?; // 384-dim

The model downloads once from HuggingFace Hub (~23MB) and is cached locally. Subsequent loads take <200ms.

Ghost auto-selects the largest chat model that fits in your RAM:

RAMModelParametersVRAM/RAM
2-4 GBQwen2.5-0.5B-Instruct0.5B~0.5 GB
4-8 GBQwen2.5-1.5B-Instruct1.5B~1.2 GB
8-16 GBQwen2.5-3B-Instruct3B~2.2 GB
16+ GBQwen2.5-7B-Instruct7B~4.5 GB

All models use Q4_K_M quantization for efficient inference on consumer hardware.

On startup, Ghost detects:

  • CPU cores — for parallel chunk processing
  • RAM — for automatic model tier selection
  • SIMD — AVX2 (Intel/AMD) or NEON (ARM) for faster tensor operations
  • GPU — CUDA (NVIDIA), Metal (Apple), Vulkan (cross-platform)

This information is displayed in the onboarding wizard and status bar.

Ghost gracefully degrades based on available resources:

Native Candle (in-process, zero deps)
↓ if native fails
Ollama HTTP API (localhost:11434)
↓ if Ollama unavailable
FTS5 keyword-only search (always works)

This ensures Ghost always works — even on low-end hardware with no internet connection.