Add an On-Device Model
CUST/OS can run LLMs, embedding models, speech-to-text, and vision detection entirely on-device with no network connectivity. This guide shows how to configure each kind.
How on-device models work
When custos.yaml declares a provider with a file:// URL, CUST/OS spawns a native inference server on the configured port. The system then calls http://127.0.0.1:<port> for inference. No external connectivity required.
Add an LLM
1. Get the model
GGUF models are the standard format. Download one from HuggingFace or your model provider:
adb push qwen2.5-7b-instruct-q4_k_m.gguf /sdcard/atak/custos/models/
2. Add a provider in custos.yaml
providers:
- name: "qwen-on-device"
task: "chat"
protocol: "openai"
runtime: "llama.cpp"
url: "file:///sdcard/atak/custos/models/qwen2.5-7b-instruct-q4_k_m.gguf"
port: 8411
model: "qwen2.5-7b"
contextSize: 8192
threads: 4
chatTemplatePath: "/sdcard/atak/custos/models/qwen3-tool-calling.jinja"
requestTimeoutMs: 120000
taskPriority: 100
tier: "handheld"
3. Save and verify
Save custos.yaml. Within about 10 seconds the Status panel should show the provider as online.
Chat templates
The on-device LLM server uses jinja templates to render chat messages and tool-call instructions. Without a tool-calling template the model cannot emit tool calls in a parseable format. Templates ship in /sdcard/atak/custos/models/:
qwen3-tool-calling.jinja-- for Qwen modelsgemma3-tool-calling.jinja-- for Gemma models
If you bring a different model family, drop in its tool-calling template and point chatTemplatePath at it.
Add an embedding model
Embeddings use the same configuration pattern with task: embedding:
- name: "local-embed"
task: "embedding"
protocol: "openai"
runtime: "llama.cpp"
url: "file:///sdcard/atak/custos/models/nomic-embed-text-v1.5.Q8_0.gguf"
port: 8414
model: "nomic-embed-text-v1.5"
contextSize: 512
threads: 4
taskPriority: 10
Embeddings power skill selection (semantic matching against the operator's message) and the vector store used by the custos.rag skill. Without an embedding provider, skill selection falls back to keyword-only matching -- still functional, just less precise.
Add a speech-to-text model (Whisper)
- name: "on-device-whisper"
task: "transcription"
protocol: "openai"
runtime: "whisper.cpp"
url: "file:///sdcard/atak/custos/models/ggml-tiny.bin"
port: 8412
model: "whisper-tiny"
taskPriority: 100
tier: "handheld"
Whisper models are GGML format (.bin). The ggml-tiny.bin model (77 MB) is the right speed/quality tradeoff for handheld devices. After adding this provider, PTT recordings are transcribed on-device.
Add a vision detection model (ONNX)
- name: "yolo-detection"
task: "detection"
protocol: "vision"
runtime: "onnxruntime"
url: "file:///sdcard/atak/custos/models/yolo11m.onnx"
port: 8413
model: "yolo11m"
threads: 4
confidence: 0.15
inputSize: 640
taskPriority: 1
tier: "handheld"
Vision detection is used by skills like custos.detect_buildings. Custom labels can be specified in the properties: map.
Add an LLM (LiteRT-LM, Gemma-4-family)
LiteRT-LM is Google's on-device inference runtime. Unlike llama.cpp it runs in-process -- no HTTP server, no subprocess -- and keeps its KV cache warm across reasoning iterations, which makes follow-up turns noticeably faster.
Model format
LiteRT-LM uses .litertlm files. These are not interchangeable with GGUF and aren't on open mirrors -- grab them from Google's Gemma 4 release channels.
Setup (verified: Gemma 4 E2B-it)
adb push gemma-4-E2B-it.litertlm /sdcard/atak/custos/models/
providers:
- name: "on-device-gemma4"
task: "chat"
protocol: "litert"
url: "file:///sdcard/atak/custos/models/gemma-4-E2B-it.litertlm"
model: "gemma-4-E2B-it"
tier: "handheld"
taskPriority: 1
contextSize: 16384
threads: 4
properties:
backend: "cpu" # "cpu" (default), "gpu", or "npu"
Model size
- Gemma 4 E2B-it — the recommended default for 16 GB-RAM handhelds. Fits
contextSize: 16384comfortably. - Gemma 4 E4B-it — the larger variant. Works, but drop
contextSizeto4096or8192on 16 GB devices to avoid OOM.
Backend selection
cpu(default) — FP32 inference. Safe on any hardware; start here if you haven't characterized your device.gpu— OpenCL. On our reference hardware (Samsung S26 Ultra, Adreno 840) the custom0.11.0-custosruntime is numerically stable, meaningfully faster than CPU, and the recommended backend once you have verified it on your device. Always re-verify before relying on GPU on a new device — multi-digit coordinate accuracy is the usual failure mode to watch for.npu— Hexagon / dedicated NPU path. Not verified on any device yet; treat as experimental.
Model format summary
| Format | Runtime | Extension | Source |
|---|---|---|---|
| GGUF | llama.cpp | .gguf |
HuggingFace |
| LiteRT-LM | LiteRT | .litertlm |
HuggingFace litert-community/ |
| GGML | whisper.cpp | .bin |
whisper.cpp releases |
| ONNX | onnxruntime | .onnx |
ONNX model zoo, Ultralytics |
Tuning for handheld devices
Handheld Android devices are RAM-constrained. Recommendations:
- Use Q4 or smaller quantizations -- Q4_K_M is the sweet spot for 4B-7B models.
- Match
threadsto performance cores -- usually 4 on a phone, 6-8 on a tablet. - Lower
contextSize-- every doubling reduces throughput. 4096 is sufficient for tactical conversations. - Pre-warm before mission -- the first inference call after startup is slow because the model is loading into RAM. Trigger it during pre-mission checks.
- Set high
taskPriorityso the local model is the fallback, not the default. Cloud providers are usually faster when comms are up.
Vulkan GPU offload (llama.cpp)
If your device has a Vulkan-capable GPU (Snapdragon 8 Gen 2+ or newer), you can offload layers:
providers:
- name: "qwen-vulkan"
task: "chat"
protocol: "openai"
runtime: "llama.cpp"
url: "file:///sdcard/atak/custos/models/qwen2.5-7b-instruct-q4_k_m.gguf"
port: 8411
model: "qwen2.5-7b"
contextSize: 4096
threads: 4
properties:
gpu-layers: "99"
taskPriority: 100
tier: "handheld"
CPU backend is recommended for reliability unless you have benchmarked your specific device and confirmed GPU offload provides a meaningful improvement.
Verify
In the Status panel, every file:// provider should show as online within 10-20 seconds of startup.