Production teams are shifting their focus away from just scaling model size and toward building sophisticated control planes. While the emergence of "thinking" models has unlocked the ability to solve complex logic and coding tasks, it has also introduced a new set of failures: unpredictable costs, high latency, and "overthinking" trivial queries.
The emerging standard for production-grade systems is Routing Before Reasoning—an architecture that treats deep inference compute as a high-cost resource to be allocated only when strictly necessary.
1. The default failure mode is over-reasoning
The primary bottleneck in modern agentic systems is an inefficient distribution of intelligence. Many standard implementations apply a uniform reasoning strategy to every request, regardless of whether the user is asking for a complex code refactor or a simple factual confirmation. This "one-size-fits-all" approach leads to a cost-performance dilemma where simple queries consume excessive GPU resources while complex queries may still be under-served.
From a systems perspective, always-on reasoning creates significant technical debt. Long reasoning traces increase KV-cache memory pressure, which makes Time-to-First-Token (TTFT) and decode latency much harder to control in multi-user environments. This resource-intensive process scales context requirements rapidly, often hitting memory walls before a complex task is completed.
2. The first decision is which path should answer
The first layer of a production agent should be a cheap, high-speed routing decision. Instead of a model reflexively firing off tools or reasoning chains, a meta-reasoning layer assesses the intent of the query first. This design mirrors human metacognition by assessing a knowledge state before consulting external tools or starting a deep-thinking process.
By gatekeeping expensive collaborative reasoning, these systems ensure that only high-ambiguity or multi-step tasks escalate to the reasoning core. According to DOVA: Deliberation-First Multi-Agent Orchestration for Autonomous Research Automation, the explicit deliberation step can reduce total token consumption by 40–60% on simple workloads without any loss in answer quality.
3. Route by ambiguity, risk, cost, and state
To move beyond simple keyword-based routing, production controllers evaluate requests across four primary dimensions:
Signal | Low | Medium | High |
Ambiguity | Deterministic path | Retrieval (RAG) | Planner/Reasoning |
Risk | Auto-execute | Prefill & Verify | Human Approval |
Cost Budget | Small model (SLM) | Mid-tier model | Reasoning model |
State Dependency | Stateless | Session-aware | Workflow-aware |
Advanced routers use "cache-aware" logic to identify when a request overlaps with context already active in the GPU's memory. Smart routers can use Radix Trees to track KV-cache blocks across a cluster, routing queries to specific instances to avoid costly recomputation. Furthermore, for high-risk or irreversible actions, the architecture must explicitly trigger human-in-the-loop gates rather than relying on model-led judgment.
4. Use cascades to match compute to task difficulty
Rather than selecting a single model for a project, engineers are building model cascades that sequentially attempt to solve a task using the cheapest possible path. If a small model’s response is deemed insufficient—based on uncertainty quantification or confidence scores—the system escalates to a more capable reasoner.
There are two primary ways to implement this routing logic today:
- Semantic Thinking Levels: New APIs allow developers to set "Thinking Levels" (e.g., Minimal, Low, Medium, High) on the same endpoint. This allows a single model to act as its own router, handling high-volume classification at a "Minimal" level and escalating to "High" only for complex hierarchical logic.
- Modular Adapters: On resource-constrained hardware, lightweight LoRA adapters can toggle "reasoning mode" on or off. This allows a fast base model to handle the majority of traffic, only activating reasoning weights when a switcher module detects a complex query.
5. Gate tools separately from responses
Enterprise-grade governance requires that tool access be gated separately from response generation. A production agent should not have broad, loosely defined access to internal systems.
The router determines which specific tools are required and ensures they are schema-constrained and explicitly allowlisted. This separates "reading" tools from "writing" tools, requiring higher levels of certainty—and often a separate routing path for human approval—for actions that update production records or trigger external workflows.
6. Evaluate the router, not just the model
Testing an agent's final answer is insufficient if the system is burning 10,000 tokens on a 10-token task. To maintain a production system, you must evaluate the quality of the control layer independent of the model's intelligence. Key performance indicators for the routing layer include:
- Routing Accuracy: The percentage of queries sent to the optimal (cheapest sufficient) model.
- Unnecessary Reasoning Rate: How often a high-compute path was selected for a task solvable by a "Minimal" thinking level.
- Escalation Precision/Recall: Does the router correctly identify when to defer to a stronger model or a human?
- Cost per Resolved Task: Measured against a single-model baseline to prove the ROI of the routing layer.
Closing: Production Agents are Control Systems
A production agent should not route every request to the most expensive reasoning path. It should choose the cheapest safe path that can complete the workflow with acceptable quality. This makes the routing layer more than just an optimization; it is the control plane for latency, cost, reliability, and autonomy.
The real engineering value is in the controller that decides when to spend reasoning compute, when to use deterministic execution, and when to require human approval.
- Author:Fan Luo
- URL:https://fanluo.me/article/routing-before-reasoning
- Copyright:All articles in this blog adopt BY-NC-SA agreement. Please indicate the source!
上一篇
Design Agents Around Workflows, Not Chat Turns
下一篇
The Production Agent Stack
