Ainfera
01 · Mithril · methodology v1.2

The best model
per task,
within your caps.

Constrained optimization. Filter to models that clear your quality floor under budget + latency caps. Pick the cheapest. Every decision on chain.

methodology v1.2 · v0 build (shipped)

Constrained optimization, not a scoring shortcut.

Most routers scalarize: w1·quality − w2·cost − w3·latency. That hides normalization in the weights, mixes incommensurable units, and can't reach non-convex regions of the cost–quality frontier. We solve the constrained form instead.

/// the objective · v0 (shipped today)compliance veto runs before scoring · M_allowed
r(x, a) =  argmin         cost(m)
           m ∈ M_allowed(x, a),
           q_prior(m, task) ≥ floor(x, a)

  s.t.   Σ cost(workflow) + ĉ(m) ≤ budget_envelope(a)
         l̂(m, x)                ≤ latency_target(x)

  ties → higher q_prior → lower latency tier

// v0 ships q_prior only — anchored to current public benchmarks for the
// 5 frontier backends. Q(m,x,a) ≈ q_prior at v0; the remaining terms
// activate at v1+ once §16 history lets q_empirical train (below).
/// where this generalizes · v1.x targetq_empirical compounds with the §16 history below
r(x, a) =  argmax         Q(m, x, a)
           m ∈ M_allowed(x, a)

  s.t.   Σ cost(workflow) + ĉ(m) ≤ budget_envelope(a)
         l̂(m, x)                ≤ latency_target(x)

Q(m, x, a) = q_prior(m, task)            // public seed (commodity)
           + q_empirical(m, x, a)       // moat — learned residual over q_prior
           + β·σ(m, x)                  // uncertainty / exploration (LinUCB; β=0 at v0)
           + γ·consistency(m, workflow) // workflow coherence
           − w_h·h(m, t)                // live provider health
           − w_r·ρ(m, x)                // residual risk

Payment, identity, and transport for agents are converging into open commodities. q_empirical is the only term that can't be reproduced by copying a spec — it's a learned residual over q_prior, trained on Ainfera's own routed-outcome records. Zero at launch; compounds with every settled §16 record below.

runtime · six stages

Hot path under 30 ms.

Control plane (policy compilation, learning, evaluation) is off the hot path. The data plane targets sub-30 ms per decision.

S1cache checkexact-match default; semantic opt-in
S2candidate setcompliance veto · budget gate → M_allowed
S3scoreQ(m, x, a) — distilled predictor on hot path
S4dispatchselected provider call
S5monitor + fallbackre-veto each hop; local model terminal
S6emitaudit · outcome · reward (async)
§16 · outcome capture (locked, immutable)

Every routed call writes the same record.

The audit chain is append-only and hash-chained — schema cannot migrate after capture. The §16 record is what makes q_empirical trainable.

/// routing_outcomes · shipped today (v0)append-only · one row per decision
{
  request_id, agent_id,
  candidates[], chosen_model,
  M_allowed_set,        // the post-veto candidate set
  q_prior_used,         // floor-clearing quality, per chosen model
  cost_projected, cost_actual_usd,
  latency_ms,
  outcome_status,       // ok | fallback | error
  seed,                 // deterministic-replay seed
  policy_version,       // {name}@{semver}
  ruleset_hash,         // hash of weights + candidate set — catches silent drift
  traffic_origin,       // fleet | external | test — dogfood-contamination guard
  fleet_agent           // nullable — per-agent dogfood analysis
}

The chain is append-only and hash-chained, so the schema is one-shot. Reserved for v1+ (empty at v0; wired on when the classifier and embedding store come online with LinUCB): task_type, task_type_source, query_embedding_ref, query_embedding_hash, embedding_model, cell. The query embedding itself lives in a durable feature store; only its hash + ref will enter the chain — privacy boundary held, training signal preserved.

a worked example

One request, four candidates, one decision.

A real call your research agent might make — the budget excludes the most expensive model, the latency cap excludes one of the cheap ones. We pick what's left.

/// your agent's code→ produces the routing decision below
# 1 SDK call. Caps in the body. We score candidates and pick.
res = client.chat.completions.create(
  model="ainfera-mithril",
  messages=[{"role": "user", "content": query}],
  extra_body={
    "caps": {
      "budget":      0.012,
      "latency_ms":  1500,
      "quality":     0.90,
      "reliability": 0.9985,
    },
  },
)

# res.routing.model           → "gemini-3-1-pro"
# res.routing.candidates      → 4 models, 1 excluded
# res.audit.id                → inf_...
requestreq_demo · task: research · agent: Varda
"Compare privacy trade-offs of federated learning vs centralized fine-tuning for medical LLMs. 3-paragraph technical response."
budget cap $0.0120
latency cap 1,500 ms
quality floor 0.90
reliability floor 99.85%
CandidateQualityCostp50 latencyReliabilityStatus
claude-opus-4-70.942$0.0061940 ms99.94%eligible · ranked 3
gpt-5-50.931$0.0058870 ms99.91%eligible · ranked 2
gemini-3-1-pro0.917$0.0049760 ms99.88%● chosen
grok-40.902$0.00421,240 ms99.71%excluded · reliability
We choose gemini-3-1-pro because it is the cheapest model that clears your quality floor under your budget and latency caps. grok-4 is excluded — its reliability of 99.71% sits below your floor of 99.85%. gpt-5-5 and claude-opus-4-7 are eligible too — they cost more, and form the fallback order if gemini errors.
your hard limits

You set the box. We pick the model inside it.

Four caps, settable per agent or per task type. If no model fits, we tell you — we don't quietly downgrade.

01 / BUDGET

Per-call cost ceiling.

A hard cap in dollars per call, or in dollars per million tokens. Cheaper models stay eligible — expensive ones drop.

02 / LATENCY

p50 wall-clock ceiling.

Measured against rolling production traffic, not vendor-published numbers. Slow models drop, even when they're cheap.

03 / QUALITY FLOOR

Minimum measured quality.

A floor on real-world quality scores, measured per task type. We won't route below this no matter how cheap or fast.

04 / RELIABILITY FLOOR

Minimum 30-day success rate.

If a provider has been flaky this month, they're out. Comes back automatically once they recover for 24 hours.

when things break

Fallback is the second-best candidate, not panic.

If the chosen model returns a 429, 5xx, timeout, or refusal, we retry on the next-ranked candidate within your caps. Logged and audited, same as any other decision.

PRIMARY · failed
gemini-3-1-pro
↻ 429 rate-limit at 180 ms
180 ms
FALLBACK · ok
gpt-5-5
✓ ok at 540 ms
540 ms
Total wall time 720 ms · Fallback overhead 180 ms · Both the failure and the retry are audited on chain.
try mithril

Run a real call. Get a real chain receipt.

Enter a prompt, set your caps, hit run. Mithril picks a model and returns the candidate set + the winner + the audit hash.

try it
One call. Real chain receipt.
Mithril picks a model under your caps · 3/3 calls left this session
42 / 500
caps
Result will land here.
Click Run. Mithril picks a model under your caps and returns the candidate set, the winner, and the chain hash. No fake numbers; the demo only works when the api endpoint is live.
decisions are data

Every routing decision is a JSON record. Fetch any of them.

No black box. No "we'll show you in the next release." The candidate set, the caps, the scores, the choice — all returned for any inference id.

curlfetch routing decision
# fetch the routing decision for any inference
curl https://api.ainfera.ai/v1/inferences/inf_.../decision \
  -H "Authorization: Bearer $AINFERA_KEY"

# → returns:
{
  "inference_id":   "inf_...",
  "task_type":      "research",
  "policy_version": "<version>",
  "caps": { "budget": 0.012, "latency_ms": 1500, "quality": 0.90, "reliability": 0.9985 },
  "candidates": [
    { "model": "claude-opus-4-7",  "score": 0.942, "cost": 0.0061, "lat_ms":  940, "chosen": false },
    { "model": "gpt-5-5",          "score": 0.931, "cost": 0.0058, "lat_ms":  870, "chosen": false },
    { "model": "gemini-3-1-pro",   "score": 0.917, "cost": 0.0049, "lat_ms":  760, "chosen": true  },
    { "model": "grok-4",           "score": 0.902, "cost": 0.0042, "lat_ms": 1240, "chosen": false,
      "excluded": "reliability_below_floor" }
  ],
  "audit_hash":  "0x...",
  "block":       "block_height"
}
next

Routing is one thing. Proof is next →