Design Agents Around Workflows, Not Chat Turns

In production systems, the failure mode is usually not that the model cannot answer. It is that the system does not know which workflow state it is in. We have reached a point where treating LLMs as simple conversational partners is a bottleneck for reliability. A chatbot answers a turn, but a production agent advances a workflow.

To build systems that scale, engineers are moving past interactional novelties toward process-first architectures. Here is the framework for moving from conversation to execution.

The wrong abstraction is the chat turn

Chat is a convenient interface, but a dangerous design unit. In many systems, interaction is treated as a simple exchange of messages, but this leads to "human glue"—users manually restating intent and tracking dependencies because the system has no structural awareness. Most conversational history eventually collapses into a linear trace where critical context—goals, constraints, and dependencies—is lost.

When everything is just a "turn," the system becomes reactive. Success requires an explicit, inspectable representation of the activity itself rather than a sequence of execution traces.

The right abstraction is the workflow state

The unit of design for production agents is the workflow state. Instead of generating text, the agent’s objective is to push a process through a transition—for instance, identifying a supply issue and moving it to "remediated". By elevating "Process" to a first-class concern, the system gains visibility into its own task structure.

This representational foundation allows for "Structural Adaptation," where the system can reorganize its own steps as goals or constraints change. As noted in The Enterprise AI Playbook, a critical success factor is fixing the underlying process before applying AI, mapping the workflow to identify real pain points.

Route before reasoning

Reasoning is expensive. Use it where ambiguity exists, not where a deterministic path is available. Effective system design uses a routing layer to dispatch requests to the most efficient handler before engaging an LLM's full reasoning capabilities.

Route Type	Mechanism	Latency/Cost	Use Case
Direct	Keyword / Regex	Lowest	Navigating to known UI states or help docs.
Retrieval	Semantic Router	Low	Fetching grounding data from a specific KB.
Execution	Mediator Pattern	Medium	Triggering a pre-designed JSON workflow.
Reasoning	LLM Planner	High	Solving novel, high-ambiguity multi-step tasks.

According to The Enterprise AI Playbook, for 42% of implementations, model choice is a "commodity" (fully interchangeable), while the orchestration of these routes provides the actual competitive advantage.

Separate planning from execution

One of the most impactful architectural shifts is the "Mediator" pattern, which decouples deciding what to do from actually carrying it out. For repeatable workflows, an LLM is used at "design time" to produce a declarative workflow blueprint—a structured JSON document specifying tool calls and data piping.

Once the blueprint is generated, a deterministic engine handles the mechanical execution (API calls, retries). According to A Workflow Engine for the Model Context Protocol, this reduces per-execution token costs by over 99% because the LLM is removed from the loop during repeated runs. Runtime reasoning is then reserved only for handling exceptions and dynamic replanning.

Evaluate trajectories, not responses

You cannot evaluate an agent the way you evaluate a static classifier. Success is determined by a trajectory: a sequence of states, actions, and observations.

As noted in Explainability in Traditional and Agentic AI Systems, failed runs are 2.7x more likely to exhibit "State Tracking Inconsistency" than final-output errors. Evaluation must focus on trajectory-level health, such as tool-choice accuracy and plan adherence, rather than just the final text response.

Human feedback is production data

Human-in-the-loop (HITL) is not a fallback; it is where production systems collect the labels that make the next run better. According to The Enterprise AI Playbook, the highest median productivity gains (71%) occur in "Escalation" models, where AI handles 80% of the work and routes only exceptions to humans.

When a user corrects an agent’s output (an "AI Override"), that "diff" is captured as a high-quality training label. Data quality improves when feedback is captured at the point of workflow correction, making the learning loop inherent in usage.

Memory should be scoped to workflow state

Memory should not be a monolithic dump of all past interactions; this causes "context explosion" and performance deterioration. In production, memory must be modular and task-specific.

According to Optimizing FaaS Platforms for MCP-enabled Agentic Workflows, memory should be persisted externally (e.g., in DynamoDB) and injected selectively. Injecting specific state memory helps avoid redundant tool calls, reducing input tokens by approximately 85% in complex sessions.

Closing

A production agent should not depend on the model reconstructing the workflow from chat history every time. The workflow state should be explicit, inspectable, and evaluable. That is the difference between a helpful assistant and a system you can operate.

Design Agents Around Workflows, Not Chat Turns

The wrong abstraction is the chat turn

The right abstraction is the workflow state

Route before reasoning

Separate planning from execution

Evaluate trajectories, not responses

Human feedback is production data

Memory should be scoped to workflow state

Closing

Relate Posts

Agent Observability Is No Longer Optional

State Is the Hard Part of Production Agents

Production Agents Run on an Autonomy Spectrum

Agent Reliability Lives in the Runtime

Routing Before Reasoning

The Production Agent Stack