Reference: Tested models
What actually works. The adapter catalog in
custos.yamlis broad; this page is narrow — only combinations that have been verified end-to-end, or deliberately ruled out.
CUST/OS is beta (as of 2026-04). This matrix is the single source of truth for what's been exercised on real hardware versus what's code-ready but unverified. Update it when you run a new combo to completion.
Status legend
| Status | Meaning |
|---|---|
VERIFIED |
Full user flow run end-to-end on the reference device. A skill or live chat consumed this provider + model combo and returned correct results. |
KNOWN_BROKEN |
Tried and failed in a specific, reproducible way. Workaround noted. |
CODE_READY |
Adapter ships in the APK, the protocol is wired, no verification run captured. Could work out of the box; report back when you try it. |
NOT_TESTED |
Pre-configured in example custos.yaml, no evidence it's been exercised. |
Reference device: Samsung Galaxy S26 Ultra (Adreno 840 GPU, 16GB RAM). Other devices may behave differently — especially on GPU-backend LiteRT-LM, where precision and memory headroom vary.
On-device LLMs
| Provider | Model | Backend | Status | Notes |
|---|---|---|---|---|
litert |
gemma-4-E2B-it |
LiteRT GPU (FP32) | VERIFIED |
Requires the custom litertlm-android-0.11.0-custos.aar — enables the FP32 activation patch so coordinate-precision survives GPU rounding. 16K context OK on 16GB devices. |
litert |
gemma-4-E4B-it |
LiteRT GPU (FP32) | KNOWN_BROKEN |
OOM at 16K context on the reference device. Drop context to 4–8K or use E2B. |
litert |
gemma-4-E4B-it |
LiteRT CPU | CODE_READY |
CPU avoids GPU precision concerns but is noticeably slower; tool calling not recently verified. |
openai (file://) |
gemma-3-4b-it-q4_k_m.gguf |
llama.cpp (GGUF) | CODE_READY |
Bundled in download-models.sh as an on-device fallback. Adapter supports tool-call + Hermes fallback; not recently re-verified on release APK. |
openai (file://) |
Other GGUF LLMs (Qwen, Llama, Mistral, etc.) | llama.cpp (GGUF) | CODE_READY |
Any GGUF supported by llama-server works, provided you drop in the matching tool-calling Jinja template and point chatTemplatePath at it. |
Known-broken tool-call mode: stock LiteRT-LM 0.10.0 with automaticToolCalling = true does not recover the numeric precision lost at GPU inference. The FP32 patch in the 0.11.0-custos AAR is the fix, and it's orthogonal to auto/manual tool calling. Our adapter keeps manual mode on so tool calls surface in the chat UI.
On-device embeddings / STT / Vision
| Provider | Model | Runtime | Status | Notes |
|---|---|---|---|---|
openai (file://) |
nomic-embed-text-v1.5.Q8_0.gguf |
llama.cpp | NOT_TESTED |
Wired in example custos.yaml; no live-run evidence captured. Skill selector falls back to BM25 when embeddings are cold, so this is non-blocking. |
openai (file://) |
all-MiniLM-L6-v2.Q8_0.gguf |
llama.cpp | NOT_TESTED |
Alternative lightweight embedder. |
openai (file://) |
ggml-tiny.bin (Whisper) |
whisper.cpp | NOT_TESTED |
Transcription for PTT. Wired but no user flow captured. |
vision |
yolo11m.onnx |
ONNX Runtime | CODE_READY |
Tile-capture end-to-end verified by custos.detect_buildings; YOLO output correctness not independently checked. |
Cloud LLMs
| Provider | Model | Adapter | Status | Notes |
|---|---|---|---|---|
openai |
grok-4-1-fast-reasoning (xAI) |
OpenAiCompatibleAdapter |
CODE_READY |
Configured in example custos.yaml; no user flow captured. |
ollama |
user-configurable | OllamaAdapter |
CODE_READY |
Tool-call support with JSON + Hermes fallback. Example config in custos.yaml commented-out. |
vllm |
user-configurable | VllmAdapter |
CODE_READY |
Dedicated adapter — handles vLLM's reasoning_content field and <think> tags so reasoning is stripped cleanly. Use protocol: vllm for vLLM servers; use protocol: openai for everything else OpenAI-compatible. |
anthropic |
user-configurable | AnthropicAdapter |
CODE_READY |
Messages API. Commented-out in example custos.yaml. |
Adding your own results
When you verify a new combo end-to-end:
- Note the exact device, model path, adapter, backend settings.
- Move the row from
CODE_READY/NOT_TESTEDtoVERIFIEDwith a one-line note on what flow it ran. - Open a PR against this file. Don't batch — small, frequent PRs keep the matrix trustworthy.
If something breaks, use KNOWN_BROKEN with the workaround. Dead-ending without a note makes future evaluators repeat the pain.
See also
- Provider protocols — field schema per protocol
- On-device inference — why LiteRT vs llama.cpp, GPU vs CPU
- Tiers and priority — how the router picks between providers