Inference Routing

Every CUST/OS deployment can have multiple LLMs, embedding models, speech-to-text engines, text-to-speech engines, and vision models -- local and remote, free and paid, fast and slow. The inference router is the layer that turns "the agent needs to call a model" into "the right provider, with the right fallback".

The problem

Tactical AI is multi-tier by nature:

Handheld -- a small, fast model running on the operator's own device. Always available. Limited reasoning.
Pack -- a model on wired peripheral compute the operator carries on their body. No RF emissions. Available whenever the peripheral has power.
Mobile -- a model on a companion compute device the squad carries. Available within proximity.
Mounted -- a larger model on a vehicle compute platform. Available when in range.
Command post -- a workstation reachable over the tactical mesh. High capability, mesh-dependent.
Cloud -- a heavyweight cloud model. Best reasoning, only available with connectivity.

The agent should use the best available provider, fall back gracefully when one goes down, and never break the operator's experience because a single provider is unhappy.

Per-task routing

The router maintains a separate provider list for each task type: chat, embedding, transcription, text-to-speech, and vision. Chat calls go to the chat list. Transcription calls go to the transcription list. There is no cross-contamination.

Within each task, providers are sorted by priority (lower number wins). On every inference call:

Walk the sorted provider list.
Skip providers that are unhealthy or blocked by the current security mode.
Try the call.
On success, return the result.
On failure, skip to the next provider.
If every provider fails, report the error.

This is entirely data-driven. There is no hardcoded logic about which provider to prefer -- the operator sets priorities in the configuration and CUST/OS does the rest.

Health monitoring

Providers are checked periodically (every few seconds by default). Each check verifies that the provider's endpoint is reachable and responsive. The result is one of:

Online -- healthy, accepting requests.
Slow -- responding, but with degraded latency.
Offline -- unreachable or erroring.

The operator can see provider health at a glance in the status panel. Health states update in real time.

Automatic failover

When a provider fails repeatedly, the router temporarily removes it from consideration and falls back to the next provider in the priority list. After a cooldown period, the failed provider is retried. If it succeeds, it resumes service. If it fails again, it is set aside for another cooldown.

This means multi-provider deployments handle outages silently. The operator's chat experience is uninterrupted -- only the status indicator changes to show which provider is active.

The router only triggers failover on infrastructure failures (timeouts, connection refused, server errors). Content failures (the model says "I don't know") are valid responses that do not affect provider health.

Tier-based filtering

Providers are grouped by tier based on their physical and trust environment:

Tier	What lives here
`handheld`	On-device models. No network involved.
`pack`	Wired peripheral compute on the operator's body. No RF emissions.
`mobile`	Squad-carried companion compute.
`mounted`	Vehicle compute platform.
`command-post`	Mesh-reachable workstations.
`cloud`	Internet-hosted providers.

Security modes filter which tiers are allowed:

Normal -- all tiers active.
No-cloud -- cloud providers blocked.
Field Only -- handheld, pack, mobile, and mounted. No command post or cloud.
Squad Only -- handheld, pack, and mobile only.
EMCON -- handheld and pack only. No wireless traffic of any kind.
Standalone -- handheld only. No external compute trusted -- not even wired peripherals.

This filter applies uniformly to all inference paths. There is no bypass. The operator can switch security modes with a single tap.

For the full reasoning behind tiers, see Tiers and priority.

Classification boundary

Before the router selects a provider, it checks whether the provider is cleared for the operator's data. Each provider declares a classification level, and data is never sent to a provider below the configured classification ceiling.

This prevents tactical data from accidentally flowing to a provider that isn't cleared to handle it.

Streaming

For chat, the router streams tokens as the model generates them. The operator sees the response being built in real time, not at the end. If a provider doesn't support streaming, the response is delivered as a single block -- the agent handles both modes transparently.

On-device models

Providers with models stored on the device start as background processes after plugin initialization. They appear online 5-30 seconds after startup, once the model is loaded into memory. Cloud providers are ready as soon as the network is up.

On-device models participate in the same routing, health monitoring, and failover infrastructure as remote providers. From the router's perspective, they are just another provider with a different tier.

LiteRT-LM providers in particular serialize every request through a per-engine mutex — only one conversation runs at a time against a given LiteRT model. That means chat and background reranking interleave rather than run in parallel. See on-device inference for detail and mitigations.

What this gets you

The router exists to make multi-provider deployments boring:

Tiered fallback -- never fail just because one provider is down.
Health-driven selection -- sick providers are skipped automatically.
Streaming-first -- the operator sees progress immediately.
Per-task routing -- each task type has its own independent provider list.
Security-mode filtering -- entire tiers can be blocked with a single setting.
Classification enforcement -- data stays within its classification boundary.