Design Agents Around Workflows, Not Chat Turns

Feb 20, 2026· 6 min read
In production systems, the failure mode is usually not that the model cannot answer. It is that the system does not know which workflow state it is in. We have reached a point where treating LLMs as simple conversational partners is a bottleneck for reliability. A chatbot answers a turn, but a production agent advances a workflow.
To build systems that scale, engineers are moving past interactional novelties toward process-first architectures. Here is the framework for moving from conversation to execution.

The wrong abstraction is the chat turn

Chat is a convenient interface, but a dangerous design unit. In many systems, interaction is treated as a simple exchange of messages, but this leads to "human glue"—users manually restating intent and tracking dependencies because the system has no structural awareness. Most conversational history eventually collapses into a linear trace where critical context—goals, constraints, and dependencies—is lost.
When everything is just a "turn," the system becomes reactive. Success requires an explicit, inspectable representation of the activity itself rather than a sequence of execution traces.

The right abstraction is the workflow state

The unit of design for production agents is the workflow state. Instead of generating text, the agent’s objective is to push a process through a transition—for instance, identifying a supply issue and moving it to "remediated". By elevating "Process" to a first-class concern, the system gains visibility into its own task structure.
This representational foundation allows for "Structural Adaptation," where the system can reorganize its own steps as goals or constraints change. As noted in The Enterprise AI Playbook, a critical success factor is fixing the underlying process before applying AI, mapping the workflow to identify real pain points.

Route before reasoning

Reasoning is expensive. Use it where ambiguity exists, not where a deterministic path is available. Effective system design uses a routing layer to dispatch requests to the most efficient handler before engaging an LLM's full reasoning capabilities.
Route Type
Mechanism
Latency/Cost
Use Case
Direct
Keyword / Regex
Lowest
Navigating to known UI states or help docs.
Retrieval
Semantic Router
Low
Fetching grounding data from a specific KB.
Execution
Mediator Pattern
Medium
Triggering a pre-designed JSON workflow.
Reasoning
LLM Planner
High
Solving novel, high-ambiguity multi-step tasks.
According to The Enterprise AI Playbook, for 42% of implementations, model choice is a "commodity" (fully interchangeable), while the orchestration of these routes provides the actual competitive advantage.

Separate planning from execution

One of the most impactful architectural shifts is the "Mediator" pattern, which decouples deciding what to do from actually carrying it out. For repeatable workflows, an LLM is used at "design time" to produce a declarative workflow blueprint—a structured JSON document specifying tool calls and data piping.
Once the blueprint is generated, a deterministic engine handles the mechanical execution (API calls, retries). According to A Workflow Engine for the Model Context Protocol, this reduces per-execution token costs by over 99% because the LLM is removed from the loop during repeated runs. Runtime reasoning is then reserved only for handling exceptions and dynamic replanning.

Evaluate trajectories, not responses

You cannot evaluate an agent the way you evaluate a static classifier. Success is determined by a trajectory: a sequence of states, actions, and observations.
As noted in Explainability in Traditional and Agentic AI Systems, failed runs are 2.7x more likely to exhibit "State Tracking Inconsistency" than final-output errors. Evaluation must focus on trajectory-level health, such as tool-choice accuracy and plan adherence, rather than just the final text response.

Human feedback is production data

Human-in-the-loop (HITL) is not a fallback; it is where production systems collect the labels that make the next run better. According to The Enterprise AI Playbook, the highest median productivity gains (71%) occur in "Escalation" models, where AI handles 80% of the work and routes only exceptions to humans.
When a user corrects an agent’s output (an "AI Override"), that "diff" is captured as a high-quality training label. Data quality improves when feedback is captured at the point of workflow correction, making the learning loop inherent in usage.

Memory should be scoped to workflow state

Memory should not be a monolithic dump of all past interactions; this causes "context explosion" and performance deterioration. In production, memory must be modular and task-specific.
According to Optimizing FaaS Platforms for MCP-enabled Agentic Workflows, memory should be persisted externally (e.g., in DynamoDB) and injected selectively. Injecting specific state memory helps avoid redundant tool calls, reducing input tokens by approximately 85% in complex sessions.
 

Closing

A production agent should not depend on the model reconstructing the workflow from chat history every time. The workflow state should be explicit, inspectable, and evaluable. That is the difference between a helpful assistant and a system you can operate.
 
 
Buy Me a Coffee
上一篇
Agent Reliability Lives in the Runtime
下一篇
Routing Before Reasoning