Skip to content

Chat Engine

Ghost includes a full chat engine powered by local LLMs — no cloud APIs, no subscriptions.

  1. Hardware Detection: Ghost scans your CPU, RAM, GPU at startup
  2. Model Selection: Automatically picks the largest model that fits comfortably
  3. Background Download: Model downloads from HuggingFace Hub in the background
  4. Native Inference: Runs via Candle GGUF (desktop) or Ollama fallback
TierModelSizeRAM Required
TinyQwen2.5-0.5B-Instruct-Q4_K_M~400MB2 GB
SmallQwen2.5-1.5B-Instruct-Q4_K_M~1.1GB4 GB
MediumQwen2.5-3B-Instruct-Q4_K_M~2.0GB8 GB
LargeQwen2.5-7B-Instruct-Q4_K_M~4.3GB16 GB

For the ReAct agent loop with tool calling, Ghost uses Qwen3 via Ollama:

TierModelRAM Required
MicroQwen3-0.6B2 GB
TinyQwen3-1.7B4 GB
SmallQwen3-4B6 GB
MediumQwen3-8B10 GB
LargeQwen3-14B18 GB
XLQwen3-32B36 GB
  • Unified Omnibox: Type naturally — Ghost auto-detects chat intent
  • Streaming responses: Token-by-token output via AG-UI events
  • Conversation memory: Persisted in SQLite with FTS5 search across past conversations
  • Debug panel: See reasoning, tool calls, and timing with Ctrl+D

All chat settings are configurable via Settings (Ctrl+,):

  • Model: Auto-select or manual choice
  • Temperature: 0.0 (deterministic) to 2.0 (creative)
  • Max tokens: Response length limit
  • Device: CPU (default), CUDA, Metal