Add an On-Device Model

CUST/OS can run LLMs, embedding models, speech-to-text, and vision detection entirely on-device with no network connectivity. This guide shows how to configure each kind.

How on-device models work

When custos.yaml declares a provider with a file:// URL, CUST/OS spawns a native inference server on the configured port. The system then calls http://127.0.0.1:<port> for inference. No external connectivity required.

Add an LLM

1. Get the model

GGUF models are the standard format. Download one from HuggingFace or your model provider:

adb push qwen2.5-7b-instruct-q4_k_m.gguf /sdcard/atak/custos/models/

2. Add a provider in custos.yaml

providers:
  - name: "qwen-on-device"
    task: "chat"
    protocol: "openai"
    runtime: "llama.cpp"
    url: "file:///sdcard/atak/custos/models/qwen2.5-7b-instruct-q4_k_m.gguf"
    port: 8411
    model: "qwen2.5-7b"
    contextSize: 8192
    threads: 4
    chatTemplatePath: "/sdcard/atak/custos/models/qwen3-tool-calling.jinja"
    requestTimeoutMs: 120000
    taskPriority: 100
    tier: "handheld"

3. Save and verify

Save custos.yaml. Within about 10 seconds the Status panel should show the provider as online.

Chat templates

The on-device LLM server uses jinja templates to render chat messages and tool-call instructions. Without a tool-calling template the model cannot emit tool calls in a parseable format. Templates ship in /sdcard/atak/custos/models/:

qwen3-tool-calling.jinja -- for Qwen models
gemma3-tool-calling.jinja -- for Gemma models

If you bring a different model family, drop in its tool-calling template and point chatTemplatePath at it.

Add an embedding model

Embeddings use the same configuration pattern with task: embedding:

  - name: "local-embed"
    task: "embedding"
    protocol: "openai"
    runtime: "llama.cpp"
    url: "file:///sdcard/atak/custos/models/nomic-embed-text-v1.5.Q8_0.gguf"
    port: 8414
    model: "nomic-embed-text-v1.5"
    contextSize: 512
    threads: 4
    taskPriority: 10

Embeddings power skill selection (semantic matching against the operator's message) and the vector store used by the custos.rag skill. Without an embedding provider, skill selection falls back to keyword-only matching -- still functional, just less precise.

Add a speech-to-text model (Whisper)

  - name: "on-device-whisper"
    task: "transcription"
    protocol: "openai"
    runtime: "whisper.cpp"
    url: "file:///sdcard/atak/custos/models/ggml-tiny.bin"
    port: 8412
    model: "whisper-tiny"
    taskPriority: 100
    tier: "handheld"

Whisper models are GGML format (.bin). The ggml-tiny.bin model (77 MB) is the right speed/quality tradeoff for handheld devices. After adding this provider, PTT recordings are transcribed on-device.

Add a vision detection model (ONNX)

  - name: "yolo-detection"
    task: "detection"
    protocol: "vision"
    runtime: "onnxruntime"
    url: "file:///sdcard/atak/custos/models/yolo11m.onnx"
    port: 8413
    model: "yolo11m"
    threads: 4
    confidence: 0.15
    inputSize: 640
    taskPriority: 1
    tier: "handheld"

Vision detection is used by skills like custos.detect_buildings. Custom labels can be specified in the properties: map.

Add an LLM (LiteRT-LM, Gemma-4-family)

LiteRT-LM is Google's on-device inference runtime. Unlike llama.cpp it runs in-process -- no HTTP server, no subprocess -- and keeps its KV cache warm across reasoning iterations, which makes follow-up turns noticeably faster.

Model format

LiteRT-LM uses .litertlm files. These are not interchangeable with GGUF and aren't on open mirrors -- grab them from Google's Gemma 4 release channels.

Setup (verified: Gemma 4 E2B-it)

adb push gemma-4-E2B-it.litertlm /sdcard/atak/custos/models/

providers:
  - name: "on-device-gemma4"
    task: "chat"
    protocol: "litert"
    url: "file:///sdcard/atak/custos/models/gemma-4-E2B-it.litertlm"
    model: "gemma-4-E2B-it"
    tier: "handheld"
    taskPriority: 1
    contextSize: 16384
    threads: 4
    properties:
      backend: "cpu"     # "cpu" (default), "gpu", or "npu"

Model size

Gemma 4 E2B-it — the recommended default for 16 GB-RAM handhelds. Fits contextSize: 16384 comfortably.
Gemma 4 E4B-it — the larger variant. Works, but drop contextSize to 4096 or 8192 on 16 GB devices to avoid OOM.

Backend selection

cpu (default) — FP32 inference. Safe on any hardware; start here if you haven't characterized your device.
gpu — OpenCL. On our reference hardware (Samsung S26 Ultra, Adreno 840) the custom 0.11.0-custos runtime is numerically stable, meaningfully faster than CPU, and the recommended backend once you have verified it on your device. Always re-verify before relying on GPU on a new device — multi-digit coordinate accuracy is the usual failure mode to watch for.
npu — Hexagon / dedicated NPU path. Not verified on any device yet; treat as experimental.

Model format summary

Format	Runtime	Extension	Source
GGUF	llama.cpp	`.gguf`	HuggingFace
LiteRT-LM	LiteRT	`.litertlm`	HuggingFace `litert-community/`
GGML	whisper.cpp	`.bin`	whisper.cpp releases
ONNX	onnxruntime	`.onnx`	ONNX model zoo, Ultralytics

Tuning for handheld devices

Handheld Android devices are RAM-constrained. Recommendations:

Use Q4 or smaller quantizations -- Q4_K_M is the sweet spot for 4B-7B models.
Match threads to performance cores -- usually 4 on a phone, 6-8 on a tablet.
Lower contextSize -- every doubling reduces throughput. 4096 is sufficient for tactical conversations.
Pre-warm before mission -- the first inference call after startup is slow because the model is loading into RAM. Trigger it during pre-mission checks.
Set high taskPriority so the local model is the fallback, not the default. Cloud providers are usually faster when comms are up.

Vulkan GPU offload (llama.cpp)

If your device has a Vulkan-capable GPU (Snapdragon 8 Gen 2+ or newer), you can offload layers:

providers:
  - name: "qwen-vulkan"
    task: "chat"
    protocol: "openai"
    runtime: "llama.cpp"
    url: "file:///sdcard/atak/custos/models/qwen2.5-7b-instruct-q4_k_m.gguf"
    port: 8411
    model: "qwen2.5-7b"
    contextSize: 4096
    threads: 4
    properties:
      gpu-layers: "99"
    taskPriority: 100
    tier: "handheld"

CPU backend is recommended for reliability unless you have benchmarked your specific device and confirmed GPU offload provides a meaningful improvement.

Verify

In the Status panel, every file:// provider should show as online within 10-20 seconds of startup.