$

llm-rate

Friday, 26 June 2026

In the last 24 hours we dispatched 1,556 tasks across 4 models. Here's what we picked, and why.

What we ran

An autonomous AI fleet, written in TypeScript, picks a model per task using a complexity router. No vibes, no PR team. This is the actual production output of that router:

ModelDispatchesShareWhy this one
01 claude-sonnet-4-6 1,066 68.5% implementation (standard)
02 claude-haiku-4-5 278 17.9% implementation (light)
03 gpt-5.4-mini 189 12.1% implementation (codex pool)
04 claude-opus-4-6 23 1.5% implementation (high complexity)

Window: 24h to 2026-05-16T00:00:00Z. Source: daemon routing logs. The router writes a decision per dispatch; we parsed 1556 of them.

If you don't have a router, here are the picks per common task

Filtered from arena.ai's leaderboard plus published API prices. Filter thresholds are listed under each tab; arguable. Treat this as a starting shortlist, not a verdict.

Fast, volume conversations. Latency-sensitive. Margin matters.

Best value

qwen3-235b-a22b-thinking-2507

Alibaba · quality 1413.7 · $0.10/M blended

Best quality

qwen3.7-max-preview

Alibaba · quality 1475.0 · $3.00/M blended

Filter: Filtered to blended price ≤ $5 per million tokens and quality ≥ 1300 Arena. Ranked by value: most quality per dollar wins. 114 models survived.

Model Quality Ctx In /1M Out /1M Value ↓
01 qwen3-235b-a22b-thinking-2507valueAlibaba 1413.7 262k $0.10 $0.10 413740.0
02 gpt-oss-120bOpenAI 1365.5 131k $0.03 $0.15 320596.5
03 gemma-3n-e4b-itGoogle 1306.2 33k $0.06 $0.12 300196.1
04 deepseek-v4-flashDeepSeek 1430.8 1.0M $0.09 $0.18 281581.7
05 gemma-3-12b-itGoogle 1334.2 131k $0.05 $0.15 278500.0
06 gemma-3-27b-itGoogle 1358.2 131k $0.08 $0.16 263404.4
07 qwen3-30b-a3b-instruct-2507Alibaba 1383.8 131k $0.05 $0.19 256598.5
08 nvidia-nemotron-3-nano-30b-a3b-bf16Nvidia 1349.2 262k $0.06 $0.24 187731.2
09 mimo-v2.5Xiaomi 1426.8 1.0M $0.10 $0.28 187626.4
10 mimo-v2-flash (non-thinking)Xiaomi 1411.4 262k $0.10 $0.30 171400.0
11 step-3.5-flashStepFun 1404.3 262k $0.09 $0.30 170573.8
12 mimo-v2-flash (thinking)Xiaomi 1395.2 262k $0.10 $0.30 164650.0
13 qwen3.7-max-previewqualityAlibaba 1475.0 1.0M $1.25 $3.75 15834.3
14 glm-5.1Z.ai 1468.3 203k $1.40 $4.40 13379.7
15 gemini-3-flashGoogle 1466.2 1.0M $0.50 $3.00 20720.4

What this is, and isn't

Right now this is filter-on-arena.ai plus a public log of what we ran. Arena Elo measures pairwise human preference on short prompts. It does not measure: whether a model produces valid JSON under a schema, whether it hallucinates function names, whether it refuses queries it shouldn't, latency p99, rate-limit behaviour. Production teams need those signals.

We're building a benchmark runner — fixed prompt suites for RAG, structured extraction, code refactoring, function calling — run daily against every model. Raw inputs, outputs, judge rationale, costs published. When that lands, the "picks" section gets its real backing. Until then, the picks section is opinion with a citation, not measurement.