The best model
per task,
within your caps.
Constrained optimization. Filter to models that clear your quality floor under budget + latency caps. Pick the cheapest. Every decision on chain.
Constrained optimization, not a scoring shortcut.
Most routers scalarize: w1·quality − w2·cost − w3·latency. That hides normalization in the weights, mixes incommensurable units, and can't reach non-convex regions of the cost–quality frontier. We solve the constrained form instead.
r(x, a) = argmin cost(m) m ∈ M_allowed(x, a), q_prior(m, task) ≥ floor(x, a) s.t. Σ cost(workflow) + ĉ(m) ≤ budget_envelope(a) l̂(m, x) ≤ latency_target(x) ties → higher q_prior → lower latency tier // v0 ships q_prior only — anchored to current public benchmarks for the // 5 frontier backends. Q(m,x,a) ≈ q_prior at v0; the remaining terms // activate at v1+ once §16 history lets q_empirical train (below).
r(x, a) = argmax Q(m, x, a) m ∈ M_allowed(x, a) s.t. Σ cost(workflow) + ĉ(m) ≤ budget_envelope(a) l̂(m, x) ≤ latency_target(x) Q(m, x, a) = q_prior(m, task) // public seed (commodity) + q_empirical(m, x, a) // moat — learned residual over q_prior + β·σ(m, x) // uncertainty / exploration (LinUCB; β=0 at v0) + γ·consistency(m, workflow) // workflow coherence − w_h·h(m, t) // live provider health − w_r·ρ(m, x) // residual risk
Payment, identity, and transport for agents are converging into open commodities. q_empirical is the only term that can't be reproduced by copying a spec — it's a learned residual over q_prior, trained on Ainfera's own routed-outcome records. Zero at launch; compounds with every settled §16 record below.
Hot path under 30 ms.
Control plane (policy compilation, learning, evaluation) is off the hot path. The data plane targets sub-30 ms per decision.
Every routed call writes the same record.
The audit chain is append-only and hash-chained — schema cannot migrate after capture. The §16 record is what makes q_empirical trainable.
{
request_id, agent_id,
candidates[], chosen_model,
M_allowed_set, // the post-veto candidate set
q_prior_used, // floor-clearing quality, per chosen model
cost_projected, cost_actual_usd,
latency_ms,
outcome_status, // ok | fallback | error
seed, // deterministic-replay seed
policy_version, // {name}@{semver}
ruleset_hash, // hash of weights + candidate set — catches silent drift
traffic_origin, // fleet | external | test — dogfood-contamination guard
fleet_agent // nullable — per-agent dogfood analysis
}The chain is append-only and hash-chained, so the schema is one-shot. Reserved for v1+ (empty at v0; wired on when the classifier and embedding store come online with LinUCB): task_type, task_type_source, query_embedding_ref, query_embedding_hash, embedding_model, cell. The query embedding itself lives in a durable feature store; only its hash + ref will enter the chain — privacy boundary held, training signal preserved.
One request, four candidates, one decision.
A real call your research agent might make — the budget excludes the most expensive model, the latency cap excludes one of the cheap ones. We pick what's left.
# 1 SDK call. Caps in the body. We score candidates and pick. res = client.chat.completions.create( model="ainfera-mithril", messages=[{"role": "user", "content": query}], extra_body={ "caps": { "budget": 0.012, "latency_ms": 1500, "quality": 0.90, "reliability": 0.9985, }, }, ) # res.routing.model → "gemini-3-1-pro" # res.routing.candidates → 4 models, 1 excluded # res.audit.id → inf_...
| Candidate | Quality | Cost | p50 latency | Reliability | Status |
|---|---|---|---|---|---|
| claude-opus-4-7 | 0.942 | $0.0061 | 940 ms | 99.94% | eligible · ranked 3 |
| gpt-5-5 | 0.931 | $0.0058 | 870 ms | 99.91% | eligible · ranked 2 |
| gemini-3-1-pro | 0.917 | $0.0049 | 760 ms | 99.88% | ● chosen |
| grok-4 | 0.902 | $0.0042 | 1,240 ms | 99.71% | excluded · reliability |
You set the box. We pick the model inside it.
Four caps, settable per agent or per task type. If no model fits, we tell you — we don't quietly downgrade.
Per-call cost ceiling.
A hard cap in dollars per call, or in dollars per million tokens. Cheaper models stay eligible — expensive ones drop.
p50 wall-clock ceiling.
Measured against rolling production traffic, not vendor-published numbers. Slow models drop, even when they're cheap.
Minimum measured quality.
A floor on real-world quality scores, measured per task type. We won't route below this no matter how cheap or fast.
Minimum 30-day success rate.
If a provider has been flaky this month, they're out. Comes back automatically once they recover for 24 hours.
Fallback is the second-best candidate, not panic.
If the chosen model returns a 429, 5xx, timeout, or refusal, we retry on the next-ranked candidate within your caps. Logged and audited, same as any other decision.
Run a real call. Get a real chain receipt.
Enter a prompt, set your caps, hit run. Mithril picks a model and returns the candidate set + the winner + the audit hash.
Every routing decision is a JSON record. Fetch any of them.
No black box. No "we'll show you in the next release." The candidate set, the caps, the scores, the choice — all returned for any inference id.
# fetch the routing decision for any inference curl https://api.ainfera.ai/v1/inferences/inf_.../decision \ -H "Authorization: Bearer $AINFERA_KEY" # → returns: { "inference_id": "inf_...", "task_type": "research", "policy_version": "<version>", "caps": { "budget": 0.012, "latency_ms": 1500, "quality": 0.90, "reliability": 0.9985 }, "candidates": [ { "model": "claude-opus-4-7", "score": 0.942, "cost": 0.0061, "lat_ms": 940, "chosen": false }, { "model": "gpt-5-5", "score": 0.931, "cost": 0.0058, "lat_ms": 870, "chosen": false }, { "model": "gemini-3-1-pro", "score": 0.917, "cost": 0.0049, "lat_ms": 760, "chosen": true }, { "model": "grok-4", "score": 0.902, "cost": 0.0042, "lat_ms": 1240, "chosen": false, "excluded": "reliability_below_floor" } ], "audit_hash": "0x...", "block": "block_height" }