The industry has long been obsessed with model performance, operating under the assumption that if an agent fails, the model simply isn't "smart" enough. However, production teams are increasingly building for a different reality: a high-performance model trapped in a "thin wrapper" architecture is almost guaranteed to fail when faced with real-world state changes, network glitches, or edge-case reasoning.
Production reliability is not solved with better prompting. It is a systems engineering problem.
Moving from an experimental demo to a dependable service requires rethinking what we mean by a Production Agent Stack. In mature systems, the stack resembles a workflow runtime more than a chatbot wrapper. Around the model sits a broader operational framework responsible for execution control, policy enforcement, validation, observability, recovery, audibility, and state management.
Inside the system, not every component should be an autonomous agent. Engineering teams should use autonomous agents where reasoning is required but rely on deterministic services where logic is fixed. This separation improves reliability, reduces operational unpredictability, and creates clearer system boundaries.

1. Runtime Owns State
In production, state cannot live only inside the model context window. The workflow runtime must own execution state through durable execution: state machines, DAGs, checkpoints, retries, resumes, and rollbacks.
The model may propose the next step, but the runtime decides what is persisted, retried, resumed, escalated, or allowed to execute. This is the boundary that turns an agent from a prompt loop into an operable system.
Frameworks such as LangGraph make this explicit by treating state as a first-class object rather than an implicit conversation history. These runtimes use explicit, reducer-driven state schemas to persist a durable ledger of facts and execution history, allowing workflows to recover from partial failures, retry tasks, resume execution, or revert state with full context—without restarting the entire workflow after a network interruption or tool failure.
2. Planner Proposes
The planner proposes a route, plan, model choice, or tool call. It is not necessarily a model. In simple systems, the planner may be an LLM prompt that suggests the next action. In production systems, it is often a hybrid control component composed of LLM reasoning, routing logic, workflow orchestration, and policy-aware decision rules.
System Complexity | Form of Planner | Core Objective | Notes |
Simple systems | Single LLM prompt
(e.g., ReAct-style prompting) | Generate the next action | Planner is primarily reactive and step-by-step |
Collaborative / concurrent systems | Task decomposer | Produce a parallelizable task DAG | Focus on decomposition, dependency resolution, and tool orchestration |
Enterprise / production systems | Hybrid control layer (controller) | Balance cost, risk, policy, latency, and performance | Combines LLM reasoning, routing logic, workflow orchestration, and governance constraints |
Model routing is a key part of this layer: not every request should be sent to the most expensive reasoning model. The planner evaluates task complexity, risk, and latency constraints before reasoning begins. Simple tasks are handled by smaller, faster models, while ambiguous, high-risk, or multi-step problems are escalated to more capable models or human review. This creates a controlled tradeoff between latency, cost, and quality.
Importantly, the planner does not own execution. Its output is only a proposal. This distinction is fundamental because LLM outputs are probabilistic and non-deterministic. A production system must never treat a generated plan as an authorized action. Every proposal must pass through validation, policy enforcement, and controlled execution layers before it is allowed to mutate system state.
3. Memory Provides Context
In a production system, memory is a cognitive aid.
Some memory lives in the context window or KV cache (the inference-time attention cache) to provide immediate context; other memory lives in retrieval stores or long-term databases. Regardless of the storage mechanism, memory is used to condition reasoning.
One of the most frequent causes of agent drift is confusing Memory with State.
- Memory (Working, Long-term, Episodic): Helps the agent reason by providing context from previous interactions.
- State (Workflow, External Committed): Tells the system what has actually happened in the real world.
Memory helps the agent reason, but state tells the system what is true. Confusing the two leads to "hallucinated progress" where the agent believes a task is complete because it's in the context window, even if the API call failed.
4. Interop Delegates
There is a growing distinction between how an agent connects to a tool and whether it is authorized to use it. Protocols like the Model Context Protocol (MCP) or Agent-to-Agent (A2A) provide the standard "plumbing"—defining how to call an API or share context.
Agent interoperability protocols such as A2A or ACP help agents delegate tasks and exchange context, but they do not replace workflow orchestration or policy enforcement.
5. Tools Expose Capabilities
Tool access is where the agent runtime connects to external systems: APIs, databases, browsers, internal services, and third-party platforms.
Protocols such as MCP, native APIs, and tool registries define what tools exist, what schemas they accept, and how responses are returned. This is the capability surface of the system.
But capability is not authority. Exposing a tool does not mean the agent should be allowed to use it in every context. A database query, refund API, shell command, or outbound message should still pass through validation, policy, and audit before it reaches execution.
6. Execution Mutates State
Tool execution should happen inside controlled environments, not directly on production systems. Modern stacks utilize Sandboxing to enforce permission boundaries.
Some systems implement Transactional No-Regression (TNR) or rollback-style recovery to manage the risks of autonomous action. If an agent’s action fails to improve the system's health, a rollback workflow can return the environment to a known safe state before requesting human guidance.
7. Validation Gates Actions
A mature production agent should never operate without a form of “digital braking system.”
A validation layer checks preconditions and validates tool arguments before they enter the execution environment.
Format validation alone is not sufficient. This layer also enforces semantic and contextual correctness against live runtime state and policy constraints. For example, when a model proposes deleting a file, the validation layer must verify whether the target path is within an allowed execution scope and whether the operation complies with current runtime permissions and safety policies.
Beyond serving as a security boundary, the validation layer also functions as a feedback mechanism for system improvement. Through RLVR (Reinforcement Learning with Verifiable Rewards), engineering teams can replace subjective human evaluation with deterministic signals derived from execution outcomes—such as whether a command executed successfully, whether a container built correctly, or whether a task completed within defined constraints.
8. Policy Cuts Across
Policy is not a single pre-check at the edge. It cuts across the runtime.
A production agent needs policy at multiple points: routing decisions, tool access, validation, execution, human approval, and rollback. The policy layer decides what the agent is allowed to do, under which risk budget, for which tenant, and with what level of human oversight.
This is especially important for irreversible or user-facing actions. A tool call that reads logs is different from a tool call that sends an email, issues a refund, deletes data, or changes production configuration.
Protocols such as MCP for tool access, and A2A / ACP for agent interoperability, expose capabilities and communication surfaces.
9. Observability & Evaluation
If you cannot replay the agent’s path, you cannot debug it, evaluate it, or govern it. A production stack requires integrated audit Traces that are structurally coupled to the decision logic. This means logging every signal score, threshold crossed, and tool call made, preserved by the framework’s checkpointing mechanism for post-incident replay. Observability provides the raw substrate for evaluation, turning runtime behavior into analyzable signals.
Evaluation is not an external benchmark applied at the edge of the system—it is a continuous feedback loop embedded throughout the stack.
In practice, production evaluation operates across three layers:
- Bottom Layer: Foundation model benchmarks (latency, quality, robustness).
- Middle Layer: Component-level performance (intent classification, routing accuracy, tool selection correctness).
- Upper Layer: Workflow trajectory evaluation (end-to-end goal completion, safety compliance, and failure recovery behavior).
Closing
The transition from experimental scripts to robust AI services is defined by the integrity of the system architecture. The goal of production AI engineering is not to build a model that never fails. It is to build a system that can recognize when its reasoning has diverged from the mission and has the architectural "brakes" to slow down, ask for help, or roll back to a safe state.
- Author:Fan Luo
- URL:https://fanluo.me/article/the-production-agent-stack
- Copyright:All articles in this blog adopt BY-NC-SA agreement. Please indicate the source!
上一篇
从传统摘要到语义合成
下一篇
Building a Simple Agent from Scratch
