Add an On-Device Model

CUST/OS can run LLMs, embedding models, speech-to-text, and vision detection entirely on-device with no network connectivity. This guide shows how to configure each kind.

How on-device models work

When custos.yaml declares a provider with a file:// URL, CUST/OS spawns a native inference server on the configured port. The system then calls http://127.0.0.1:<port> for inference. No external connectivity required.

Add an LLM

1. Get the model

GGUF models are the standard format. Download one from HuggingFace or your model provider:

adb push qwen2.5-7b-instruct-q4_k_m.gguf /sdcard/atak/custos/models/

2. Add a provider in custos.yaml

providers:
  - name: "qwen-on-device"
    task: "chat"
    protocol: "openai"
    runtime: "llama.cpp"
    url: "file:///sdcard/atak/custos/models/qwen2.5-7b-instruct-q4_k_m.gguf"
    port: 8411
    model: "qwen2.5-7b"
    contextSize: 8192
    threads: 4
    chatTemplatePath: "/sdcard/atak/custos/models/qwen3-tool-calling.jinja"
    requestTimeoutMs: 120000
    taskPriority: 100
    tier: "handheld"

3. Save and verify

Save custos.yaml. Within about 10 seconds the Status panel should show the provider as online.

Chat templates

The on-device LLM server uses jinja templates to render chat messages and tool-call instructions. Without a tool-calling template the model cannot emit tool calls in a parseable format. Templates ship in /sdcard/atak/custos/models/:

  • qwen3-tool-calling.jinja -- for Qwen models
  • gemma3-tool-calling.jinja -- for Gemma models

If you bring a different model family, drop in its tool-calling template and point chatTemplatePath at it.

Add an embedding model

Embeddings use the same configuration pattern with task: embedding:

  - name: "local-embed"
    task: "embedding"
    protocol: "openai"
    runtime: "llama.cpp"
    url: "file:///sdcard/atak/custos/models/nomic-embed-text-v1.5.Q8_0.gguf"
    port: 8414
    model: "nomic-embed-text-v1.5"
    contextSize: 512
    threads: 4
    taskPriority: 10

Embeddings power skill selection (semantic matching against the operator's message) and the vector store used by the custos.rag skill. Without an embedding provider, skill selection falls back to keyword-only matching -- still functional, just less precise.

Add a speech-to-text model (Whisper)

  - name: "on-device-whisper"
    task: "transcription"
    protocol: "openai"
    runtime: "whisper.cpp"
    url: "file:///sdcard/atak/custos/models/ggml-tiny.bin"
    port: 8412
    model: "whisper-tiny"
    taskPriority: 100
    tier: "handheld"

Whisper models are GGML format (.bin). The ggml-tiny.bin model (77 MB) is the right speed/quality tradeoff for handheld devices. After adding this provider, PTT recordings are transcribed on-device.

Add a vision detection model (ONNX)

  - name: "yolo-detection"
    task: "detection"
    protocol: "vision"
    runtime: "onnxruntime"
    url: "file:///sdcard/atak/custos/models/yolo11m.onnx"
    port: 8413
    model: "yolo11m"
    threads: 4
    confidence: 0.15
    inputSize: 640
    taskPriority: 1
    tier: "handheld"

Vision detection is used by skills like custos.detect_buildings. Custom labels can be specified in the properties: map.

Add an LLM (LiteRT-LM, Gemma-4-family)

LiteRT-LM is Google's on-device inference runtime. Unlike llama.cpp it runs in-process -- no HTTP server, no subprocess -- and keeps its KV cache warm across reasoning iterations, which makes follow-up turns noticeably faster.

Model format

LiteRT-LM uses .litertlm files. These are not interchangeable with GGUF and aren't on open mirrors -- grab them from Google's Gemma 4 release channels.

Setup (verified: Gemma 4 E2B-it)

adb push gemma-4-E2B-it.litertlm /sdcard/atak/custos/models/
providers:
  - name: "on-device-gemma4"
    task: "chat"
    protocol: "litert"
    url: "file:///sdcard/atak/custos/models/gemma-4-E2B-it.litertlm"
    model: "gemma-4-E2B-it"
    tier: "handheld"
    taskPriority: 1
    contextSize: 16384
    threads: 4
    properties:
      backend: "cpu"     # "cpu" (default), "gpu", or "npu"

Model size

  • Gemma 4 E2B-it — the recommended default for 16 GB-RAM handhelds. Fits contextSize: 16384 comfortably.
  • Gemma 4 E4B-it — the larger variant. Works, but drop contextSize to 4096 or 8192 on 16 GB devices to avoid OOM.

Backend selection

  • cpu (default) — FP32 inference. Safe on any hardware; start here if you haven't characterized your device.
  • gpu — OpenCL. On our reference hardware (Samsung S26 Ultra, Adreno 840) the custom 0.11.0-custos runtime is numerically stable, meaningfully faster than CPU, and the recommended backend once you have verified it on your device. Always re-verify before relying on GPU on a new device — multi-digit coordinate accuracy is the usual failure mode to watch for.
  • npu — Hexagon / dedicated NPU path. Not verified on any device yet; treat as experimental.

Model format summary

Format Runtime Extension Source
GGUF llama.cpp .gguf HuggingFace
LiteRT-LM LiteRT .litertlm HuggingFace litert-community/
GGML whisper.cpp .bin whisper.cpp releases
ONNX onnxruntime .onnx ONNX model zoo, Ultralytics

Tuning for handheld devices

Handheld Android devices are RAM-constrained. Recommendations:

  • Use Q4 or smaller quantizations -- Q4_K_M is the sweet spot for 4B-7B models.
  • Match threads to performance cores -- usually 4 on a phone, 6-8 on a tablet.
  • Lower contextSize -- every doubling reduces throughput. 4096 is sufficient for tactical conversations.
  • Pre-warm before mission -- the first inference call after startup is slow because the model is loading into RAM. Trigger it during pre-mission checks.
  • Set high taskPriority so the local model is the fallback, not the default. Cloud providers are usually faster when comms are up.

Vulkan GPU offload (llama.cpp)

If your device has a Vulkan-capable GPU (Snapdragon 8 Gen 2+ or newer), you can offload layers:

providers:
  - name: "qwen-vulkan"
    task: "chat"
    protocol: "openai"
    runtime: "llama.cpp"
    url: "file:///sdcard/atak/custos/models/qwen2.5-7b-instruct-q4_k_m.gguf"
    port: 8411
    model: "qwen2.5-7b"
    contextSize: 4096
    threads: 4
    properties:
      gpu-layers: "99"
    taskPriority: 100
    tier: "handheld"

CPU backend is recommended for reliability unless you have benchmarked your specific device and confirmed GPU offload provides a meaningful improvement.

Verify

In the Status panel, every file:// provider should show as online within 10-20 seconds of startup.