On-Device Inference

Running inference on the operator's device is the foundation of edge sovereignty. No cloud dependency, no network requirement, no data leaving the classification trust boundary.

Why on-device

Edge sovereignty. The operator's device is the only compute platform guaranteed to be available. Networks go down, vehicles drive away, command posts move. The device in the operator's hand is always there.
Offline operation. Tactical networks are intermittent by design. An agent that stops working when comms drop is not a tactical agent.
Classification trust boundary. Data that never leaves the device never crosses a classification boundary. No accreditation paperwork for inference traffic, no risk of spillage.

On-device inference is slower and less capable than cloud models. CUST/OS mitigates this through tiered fallback (use cloud when available, fall back to device when not) and by selecting models that are good enough for tactical tool-calling tasks.

Two runtimes

llama.cpp

The proven workhorse. Runs as a native HTTP subprocess on the device. The agent talks to it through the standard OpenAI-compatible API on localhost.

Model format: GGUF (quantized, single-file)
GPU: Vulkan (optional)
Maturity: Stable, well-tested, broad model support
Trade-off: Full prompt re-processing on every reasoning iteration

LiteRT-LM

Google's on-device inference SDK. Runs in-process via JNI -- no subprocess, no HTTP overhead.

Model format: .litertlm (converted from SafeTensors)
GPU: OpenCL (automatic detection)
Maturity: Newer, fewer models available
Advantage: Persists state across reasoning iterations -- only new tokens get processed on follow-up turns. This cuts follow-up latency roughly in half compared to llama.cpp.

Recommended models

See Tested models for the authoritative verification matrix. In brief:

Model	Format	Status	Notes
Gemma 4 E2B-it	.litertlm	Verified on Samsung S26 Ultra	The recommended default. Reliable tool calling, 16K context comfortable on 16 GB devices.
Gemma 4 E4B-it	.litertlm	Code-ready	Works, but drop `contextSize` to 4–8K on 16 GB devices to avoid OOM.
Gemma 3 4B IT	.gguf (Q4_K_M)	Code-ready	Bundled as a llama.cpp fallback by `download-models.sh`.
Other GGUFs (Qwen, Llama, etc.)	.gguf	Code-ready	Any GGUF supported by llama-server works; bring the matching tool-calling Jinja template.

Tool-calling reliability matters more than benchmark scores for CUST/OS. Every "verified" combo has run real skill and chat flows end-to-end, not just single-shot generation.

Current recommendation

Gemma 4 E2B-it on LiteRT-LM.

Backend pick depends on whether you've validated your hardware:

GPU backend — verified numerically stable on Samsung S26 Ultra (Adreno 840) with our custom 0.11.0-custos LiteRT-LM build, which applies an FP32 activation patch that fixes the multi-digit coordinate rounding the stock OSS build exhibited. Recommended once validated on your specific device.
CPU backend — slower but safe on any device. Start here if you haven't characterized your hardware.

Configure as a handheld tier provider:

providers:
  - name: "on-device-gemma4"
    task: chat
    protocol: litert
    model: "gemma-4-E2B-it"
    url: "file:///sdcard/atak/custos/models/gemma-4-E2B-it.litertlm"
    tier: handheld
    taskPriority: 1
    contextSize: 16384
    threads: 4
    properties:
      backend: "cpu"     # "cpu", "gpu", or "npu"

The npu backend (Hexagon / dedicated NPU) is also exposed but not verified on any device yet — treat as experimental.

Performance expectations

On modern flagship Android hardware, expect:

LiteRT-LM — first prefill takes longer (full initial processing); follow-up reasoning iterations are noticeably faster because the KV cache persists across ReAct iterations. This is the main reason LiteRT is the recommended handheld runtime today.
llama.cpp CPU — consistent but slower throughput. Memory bandwidth is the bottleneck on mobile devices. Every ReAct iteration re-processes the full prompt.
LiteRT GPU — meaningfully faster than CPU on the reference device once validated. Always verify multi-digit coordinate accuracy before relying on it on unfamiliar hardware.

Serialized sessions

LiteRT-LM permits only one active Conversation per Engine at a time. The adapter serializes every inference call through a mutex, which means background tasks (like the skill reranker or embedding rebuild) and foreground chat never run in parallel against the same LiteRT model — they interleave. You'll notice this as a brief pause in the chat stream when a background request fires. It's intentional: collisions produced FAILED_PRECONDITION: A session already exists errors that tripped the router's circuit breaker, so the adapter takes the lock instead.

If you need true parallelism (chat + background reranker running simultaneously), configure them against different providers — e.g. chat on the local Gemma 4 and embeddings on a LAN server.

What is next

Broader device verification of the GPU backend — the 0.11.0-custos FP32 activation patch should generalize across OpenCL GPUs, but each vendor's driver needs its own coordinate-precision check.
NPU backend verification on devices with Hexagon / dedicated NPU paths.
Newer SoCs with higher memory bandwidth will make larger models practical on-device.
Smaller specialist models tuned for tool calling (rather than general chat) could reduce model size without sacrificing tactical capability.