Reference: Tested models

What actually works. The adapter catalog in custos.yaml is broad; this page is narrow — only combinations that have been verified end-to-end, or deliberately ruled out.

CUST/OS is beta (as of 2026-04). This matrix is the single source of truth for what's been exercised on real hardware versus what's code-ready but unverified. Update it when you run a new combo to completion.

Status legend

Status Meaning
VERIFIED Full user flow run end-to-end on the reference device. A skill or live chat consumed this provider + model combo and returned correct results.
KNOWN_BROKEN Tried and failed in a specific, reproducible way. Workaround noted.
CODE_READY Adapter ships in the APK, the protocol is wired, no verification run captured. Could work out of the box; report back when you try it.
NOT_TESTED Pre-configured in example custos.yaml, no evidence it's been exercised.

Reference device: Samsung Galaxy S26 Ultra (Adreno 840 GPU, 16GB RAM). Other devices may behave differently — especially on GPU-backend LiteRT-LM, where precision and memory headroom vary.

On-device LLMs

Provider Model Backend Status Notes
litert gemma-4-E2B-it LiteRT GPU (FP32) VERIFIED Requires the custom litertlm-android-0.11.0-custos.aar — enables the FP32 activation patch so coordinate-precision survives GPU rounding. 16K context OK on 16GB devices.
litert gemma-4-E4B-it LiteRT GPU (FP32) KNOWN_BROKEN OOM at 16K context on the reference device. Drop context to 4–8K or use E2B.
litert gemma-4-E4B-it LiteRT CPU CODE_READY CPU avoids GPU precision concerns but is noticeably slower; tool calling not recently verified.
openai (file://) gemma-3-4b-it-q4_k_m.gguf llama.cpp (GGUF) CODE_READY Bundled in download-models.sh as an on-device fallback. Adapter supports tool-call + Hermes fallback; not recently re-verified on release APK.
openai (file://) Other GGUF LLMs (Qwen, Llama, Mistral, etc.) llama.cpp (GGUF) CODE_READY Any GGUF supported by llama-server works, provided you drop in the matching tool-calling Jinja template and point chatTemplatePath at it.

Known-broken tool-call mode: stock LiteRT-LM 0.10.0 with automaticToolCalling = true does not recover the numeric precision lost at GPU inference. The FP32 patch in the 0.11.0-custos AAR is the fix, and it's orthogonal to auto/manual tool calling. Our adapter keeps manual mode on so tool calls surface in the chat UI.

On-device embeddings / STT / Vision

Provider Model Runtime Status Notes
openai (file://) nomic-embed-text-v1.5.Q8_0.gguf llama.cpp NOT_TESTED Wired in example custos.yaml; no live-run evidence captured. Skill selector falls back to BM25 when embeddings are cold, so this is non-blocking.
openai (file://) all-MiniLM-L6-v2.Q8_0.gguf llama.cpp NOT_TESTED Alternative lightweight embedder.
openai (file://) ggml-tiny.bin (Whisper) whisper.cpp NOT_TESTED Transcription for PTT. Wired but no user flow captured.
vision yolo11m.onnx ONNX Runtime CODE_READY Tile-capture end-to-end verified by custos.detect_buildings; YOLO output correctness not independently checked.

Cloud LLMs

Provider Model Adapter Status Notes
openai grok-4-1-fast-reasoning (xAI) OpenAiCompatibleAdapter CODE_READY Configured in example custos.yaml; no user flow captured.
ollama user-configurable OllamaAdapter CODE_READY Tool-call support with JSON + Hermes fallback. Example config in custos.yaml commented-out.
vllm user-configurable VllmAdapter CODE_READY Dedicated adapter — handles vLLM's reasoning_content field and <think> tags so reasoning is stripped cleanly. Use protocol: vllm for vLLM servers; use protocol: openai for everything else OpenAI-compatible.
anthropic user-configurable AnthropicAdapter CODE_READY Messages API. Commented-out in example custos.yaml.

Adding your own results

When you verify a new combo end-to-end:

  1. Note the exact device, model path, adapter, backend settings.
  2. Move the row from CODE_READY / NOT_TESTED to VERIFIED with a one-line note on what flow it ran.
  3. Open a PR against this file. Don't batch — small, frequent PRs keep the matrix trustworthy.

If something breaks, use KNOWN_BROKEN with the workaround. Dead-ending without a note makes future evaluators repeat the pain.

See also