Reference: Tested models

What actually works. The adapter catalog in custos.yaml is broad; this page is narrow — only combinations that have been verified end-to-end, or deliberately ruled out.

CUST/OS is beta (as of 2026-04). This matrix is the single source of truth for what's been exercised on real hardware versus what's code-ready but unverified. Update it when you run a new combo to completion.

Status legend

Status	Meaning
`VERIFIED`	Full user flow run end-to-end on the reference device. A skill or live chat consumed this provider + model combo and returned correct results.
`KNOWN_BROKEN`	Tried and failed in a specific, reproducible way. Workaround noted.
`CODE_READY`	Adapter ships in the APK, the protocol is wired, no verification run captured. Could work out of the box; report back when you try it.
`NOT_TESTED`	Pre-configured in example `custos.yaml`, no evidence it's been exercised.

Reference device: Samsung Galaxy S26 Ultra (Adreno 840 GPU, 16GB RAM). Other devices may behave differently — especially on GPU-backend LiteRT-LM, where precision and memory headroom vary.

On-device LLMs

Provider	Model	Backend	Status	Notes
`litert`	`gemma-4-E2B-it`	LiteRT GPU (FP32)	`VERIFIED`	Requires the custom `litertlm-android-0.11.0-custos.aar` — enables the FP32 activation patch so coordinate-precision survives GPU rounding. 16K context OK on 16GB devices.
`litert`	`gemma-4-E4B-it`	LiteRT GPU (FP32)	`KNOWN_BROKEN`	OOM at 16K context on the reference device. Drop context to 4–8K or use `E2B`.
`litert`	`gemma-4-E4B-it`	LiteRT CPU	`CODE_READY`	CPU avoids GPU precision concerns but is noticeably slower; tool calling not recently verified.
`openai` (file://)	`gemma-3-4b-it-q4_k_m.gguf`	llama.cpp (GGUF)	`CODE_READY`	Bundled in `download-models.sh` as an on-device fallback. Adapter supports tool-call + Hermes fallback; not recently re-verified on release APK.
`openai` (file://)	Other GGUF LLMs (Qwen, Llama, Mistral, etc.)	llama.cpp (GGUF)	`CODE_READY`	Any GGUF supported by llama-server works, provided you drop in the matching tool-calling Jinja template and point `chatTemplatePath` at it.

Known-broken tool-call mode: stock LiteRT-LM 0.10.0 with automaticToolCalling = true does not recover the numeric precision lost at GPU inference. The FP32 patch in the 0.11.0-custos AAR is the fix, and it's orthogonal to auto/manual tool calling. Our adapter keeps manual mode on so tool calls surface in the chat UI.

On-device embeddings / STT / Vision

Provider	Model	Runtime	Status	Notes
`openai` (file://)	`nomic-embed-text-v1.5.Q8_0.gguf`	llama.cpp	`NOT_TESTED`	Wired in example `custos.yaml`; no live-run evidence captured. Skill selector falls back to BM25 when embeddings are cold, so this is non-blocking.
`openai` (file://)	`all-MiniLM-L6-v2.Q8_0.gguf`	llama.cpp	`NOT_TESTED`	Alternative lightweight embedder.
`openai` (file://)	`ggml-tiny.bin` (Whisper)	whisper.cpp	`NOT_TESTED`	Transcription for PTT. Wired but no user flow captured.
`vision`	`yolo11m.onnx`	ONNX Runtime	`CODE_READY`	Tile-capture end-to-end verified by `custos.detect_buildings`; YOLO output correctness not independently checked.

Cloud LLMs

Provider	Model	Adapter	Status	Notes
`openai`	`grok-4-1-fast-reasoning` (xAI)	`OpenAiCompatibleAdapter`	`CODE_READY`	Configured in example `custos.yaml`; no user flow captured.
`ollama`	user-configurable	`OllamaAdapter`	`CODE_READY`	Tool-call support with JSON + Hermes fallback. Example config in `custos.yaml` commented-out.
`vllm`	user-configurable	`VllmAdapter`	`CODE_READY`	Dedicated adapter — handles vLLM's `reasoning_content` field and `<think>` tags so reasoning is stripped cleanly. Use `protocol: vllm` for vLLM servers; use `protocol: openai` for everything else OpenAI-compatible.
`anthropic`	user-configurable	`AnthropicAdapter`	`CODE_READY`	Messages API. Commented-out in example `custos.yaml`.

Adding your own results

When you verify a new combo end-to-end:

Note the exact device, model path, adapter, backend settings.
Move the row from CODE_READY / NOT_TESTED to VERIFIED with a one-line note on what flow it ran.
Open a PR against this file. Don't batch — small, frequent PRs keep the matrix trustworthy.

If something breaks, use KNOWN_BROKEN with the workaround. Dead-ending without a note makes future evaluators repeat the pain.