Ainfera
How routing works

Pick the model by the result, not the reputation.

Every agent call is hard to place — capability, cost and latency trade off differently each time. Ainfera scores the candidates against the task and routes to the one most likely to finish it. Here's what goes into that, and the proof it leaves behind.

01 · Signals we weigh

Four inputs decide whether a call finishes.

These are the signals every candidate is scored on. How we weigh them is the part that compounds with traffic — so the weights stay ours — but the inputs are no secret.

Task type

What the call is

A drafting call and a tool-use call don't want the same model. We read the shape of the request first.

Cost

What it costs

Live per-token price for each candidate, against the ceiling you set.

Latency

How fast it answers

Measured on rolling production traffic, not vendor-published numbers.

Availability

Whether it's healthy now

A provider that's erroring or rate-limiting this minute drops out, and comes back when it recovers.

  1. GPT-OSS 120B (Groq)257 tok/s
  2. GPT-OSS 20B (Novita)246 tok/s
  3. Qwen3.7 Max (Novita)200 tok/s
  4. Qwen3-Next-80B-A3B-Instruct190 tok/s
  5. GLM 5.2181 tok/s
  6. Qwen3.5-35B-A3B159 tok/s
  7. Qwen3.5-122B-A10B142 tok/s
  8. Qwen3-VL-8B-Instruct141 tok/s
  9. Qwen3.6-35B-A3B140 tok/s
  10. GLM-4.7117 tok/s
  11. Qwen3-VL-30B-A3B-Instruct116 tok/s
  12. Qwen3 Coder Next111 tok/s
Reference output speed (Artificial Analysis). We score live per-call latency on top of this — measured on production traffic, not published numbers.

Intelligence: Artificial Analysis · artificialanalysis.ai

02 · Outcome

We route to the model most likely to finish the task.

Not the biggest name, not a model you pinned six months ago and forgot. The pick is made per call and changes as price, speed and health change — so the cheapest model that still clears the bar is the one that runs.

2040060120180240Output speed · tokens / sec →Intelligence Index ↑ZGLM 5.2 · Z.ai (GLM) · index 51 · 181 tok/s · preferred coreQwen3.7 Max (Novita) · Alibaba (Qwen) · index 46 · 200 tok/s · preferred coreDeepSeek V4 Pro (Together) · DeepSeek · index 44 · 65 tok/s · preferred coreMiniMax-M3 · MiniMax · index 44 · 98 tok/s · preferred coreDeepSeek-V4-Flash · DeepSeek · index 40 · 93 tok/s · preferred coreZGLM 5.1 (Novita) · Z.ai (GLM) · index 40 · 67 tok/s · preferred coreZGLM-5 · Z.ai (GLM) · index 40 · 61 tok/s · preferred coreQwen3.6 Plus · Alibaba (Qwen) · index 40 · 52 tok/s · preferred coreQwen3.7 Plus · Alibaba (Qwen) · index 39 · 52 tok/s · preferred coreMiniMax M2.7 (Together) · MiniMax · index 38 · 50 tok/s · preferred coreQwen3.6-27B · Alibaba (Qwen) · index 37 · 59 tok/s · preferred coreZGLM-4.7 · Z.ai (GLM) · index 34 · 117 tok/s · preferred coreMiniMax M2.5 · MiniMax · index 34 · 82 tok/s · preferred coreQwen3.5 397B A17B (DeepInfra) · Alibaba (Qwen) · index 34 · 51 tok/s · preferred coreQwen3.5 397B A17B (Together) · Alibaba (Qwen) · index 34 · 51 tok/s · preferred coreQwen3.5-27B · Alibaba (Qwen) · index 34 · 83 tok/s · preferred coreQwen3.5-122B-A10B · Alibaba (Qwen) · index 32 · 142 tok/s · preferred coreQwen3.6-35B-A3B · Alibaba (Qwen) · index 32 · 140 tok/s · preferred coreMinimax M2.1 · MiniMax · index 31 · 80 tok/s · preferred coreQwen3.5-35B-A3B · Alibaba (Qwen) · index 29 · 159 tok/s · preferred coreGPT-OSS 120B (Groq) · OpenAI · index 24 · 257 tok/s · preferred coreQwen3-Max · Alibaba (Qwen) · index 24 · 61 tok/s · preferred coreZGLM-4.6 · Z.ai (GLM) · index 23 · 41 tok/s · preferred coreZGLM-4.7-Flash · Z.ai (GLM) · index 23 · 85 tok/s · preferred coreQwen3 Coder Next · Alibaba (Qwen) · index 21 · 111 tok/s · preferred coreQwen3.5 9B FP8 · Alibaba (Qwen) · index 21 · 48 tok/s · preferred coreQwen3 235B A22B Instruct 2507 · Alibaba (Qwen) · index 18 · 63 tok/s · preferred coreQwen3 Coder 480B A35B Instruct · Alibaba (Qwen) · index 18 · 69 tok/s · preferred coreZzai-org/glm-4.5-air · Z.ai (GLM) · index 16 · 77 tok/s · preferred coreGPT-OSS 20B (Novita) · OpenAI · index 15 · 246 tok/s · preferred coreQwen3 Coder 30b A3B Instruct · Alibaba (Qwen) · index 14 · 108 tok/s · preferred coreQwen3-Next-80B-A3B-Instruct · Alibaba (Qwen) · index 14 · 190 tok/s · preferred coreQwen3-VL-235B-A22B-Instruct · Alibaba (Qwen) · index 14 · 50 tok/s · preferred coreQwen QwQ-32B · Alibaba (Qwen) · index 13 · 32 tok/s · preferred coreZGLM 4.6V · Z.ai (GLM) · index 11 · 49 tok/s · preferred coreQwen3-VL-32B-Instruct · Alibaba (Qwen) · index 11 · 75 tok/s · preferred coreDeepSeek R1 Distill LLama 70B · DeepSeek · index 10 · 27 tok/s · preferred coreQwen3-VL-30B-A3B-Instruct · Alibaba (Qwen) · index 10 · 116 tok/s · preferred coreQwen3-VL-8B-Instruct · Alibaba (Qwen) · index 8 · 141 tok/s · preferred coreZGLM 4.5V · Z.ai (GLM) · index 7 · 43 tok/s · preferred coreQwen3 Omni 30B A3B Instruct · Alibaba (Qwen) · index 5 · 104 tok/s · preferred core
Faster isn’t smarter — we pick the point that finishes the task inside your caps.Speed is the Artificial Analysis reference; live per-call latency is scored on top of it.

Intelligence + speed: Artificial Analysis · artificialanalysis.ai

03 · Your controls

You set the box. We pick the model inside it.

Routing is yours to bound. Three controls, settable per agent or per task type.

Caps

Set the box

Per-call cost ceilings and latency targets, per agent or per task type. If nothing fits, we tell you — we never quietly downgrade.

Pins

Force a model

Pin a specific model or provider when you need it, and keep routing everywhere else.

Fallbacks

Stay up

On a 429, 5xx, timeout or refusal we retry the next eligible candidate inside your caps — logged and audited like any other call.

04 · Proof

Every decision is signed, on a public chain.

No black box and no dashboard claim. Every routed call is hashed, Ed25519-signed, and appended to an append-only chain. Verify any one of them with a single keyless request — no account, no key.

Trace · live auditlive
timehash · agentevent · modelseq
13:22:15
0x91b0…55c6 · verify-gold
provider ok · openai
2,528
13:22:15
0x76dc…99cd · verify-gold
refunded
2,527
13:22:15
0xdb77…556f · verify-gold
created
2,529
13:22:08
0x1df7…7c83 · verify-gold
debited
2,525
13:22:08
0x8a3d…01b8 · verify-gold
routed · gpt-5-5
2,526
13:22:08
0x255e…a0df · verify-gold
request · gpt-5-5
2,524
verify — no key required
# the public chain is keyless
curl https://api.ainfera.ai/v1/audit/public

# → each entry: the routed model, provider,
#   sequence, block height and the Ed25519
#   signature. Re-hash it yourself to verify.

Stop picking models. Start finishing tasks.

One endpoint, every provider, every decision on chain.

routing · activeblock #55,167models · 241audit · on-chainainfera · the inference of ai agents