Fan’s Blog

AI research, systems, and engineering notes.

Featured

The Production Agent Stack

A reliable agent is not just an LLM connected to tools. A production agent stack is a system of layered responsibilities. The runtime owns execution state and governs workflow progression. The planner proposes next steps, but proposals are not execution. Memory provides contextual recall without serving as the source of truth. Agent interoperability enables structured delegation, while tools expose external capabilities through standardized protocols such as MCP. Validation transforms probabilistic model outputs into structured, policy-constrained proposals that can safely enter the execution pipeline. Execution itself occurs inside isolated runtime environments where side effects can be controlled, audited, recovered, or rolled back.

May 26, 2026
Building Auditable LLM Workflows for Medical Coding
Medical coding is a high-stakes extraction and verification problem, not a simple text generation task. Asking an LLM to read a long clinical note and directly output ICD codes risks hallucinated mappings, missed comorbidities, and results that are difficult for human coders to audit. A reliable medical coding system should be built as an LLM-assisted workflow: extract clinical evidence, retrieve candidate codes, verify mappings, validate against the taxonomy, and route uncertainty to human review. The model should not be expected to memorize every code. Its job is to help produce auditable evidence inside a controlled workflow.
#Production AI#Applied AI#NLP#System Design
May 10, 2026
World Models are becoming the simulation substrate for Agents
Agent world models are emerging as an important simulation layer between reasoning and execution. Early LLM agents followed a fragile loop: prompt, think, call a tool, wait for the result. In production, that loop is expensive and risky because the agent often cannot predict whether an action will move the workflow forward, fail silently, or mutate state in an unsafe way. A world model acts as a surrogate environment. Given a current state and candidate action, it predicts likely next states, failure modes, and observations. This allows agents to rank possible actions before touching real systems.
#AI Agent#Production AI#System Design
May 2, 2026
The Runtime Behind Production AI
A layered framework for scaling production AI systems begins with the SLA: latency, throughput, reliability, cost per resolved task, fallback behavior, and quality targets. Those requirements drive the architecture of the runtime — spanning the edge gateway, safety and governance, orchestration and routing, inference serving, compute scheduling, context and state management, model lifecycle operations, and observability.
#AI Agent#Production AI#AI Infra#System Design
Apr 21, 2026
Agent Observability Is Not Optional
Production agents are hard to operate because teams need to understand why they acted. Traditional observability tracks service health: latency, errors, throughput, CPU, and memory. Agent observability must go further. It must capture intent, workflow state, retrieved context, model and prompt versions, tool proposals, policy checks, approvals, state mutations, and final outcomes. Enterprise trust requires replay. A reliable agent system must be able to prove which decision, context bundle, and policy check allowed an autonomous action to happen.
#AI Agent#Production AI#System Design#AI Infra
Apr 6, 2026
State Is the Hard Part of Production Agents
As AI agents move from short-lived chat interactions to long-running autonomous systems, the hardest engineering problems are no longer about prompts or model quality. They are about state management, replay safety, memory hierarchy, checkpointing, and transactional execution. Production agents need a cache-aware, transactional runtime. Agent state should not be a probabilistic byproduct of a chat log; it should be a deterministic projection of validated events.
#AI Agent#AI Infra#Production AI
Mar 17, 2026
Production Agents Run on an Autonomy Spectrum
Production agents should not be designed around the fantasy of full autonomy. In real environments, agents face brittle interfaces, evolving user preferences, security gates, ambiguous state, and irreversible actions. The goal is not to remove humans entirely, but to build systems that know when autonomy is safe and when control should be reduced. A reliable agent is not one that never needs help. It is one that knows when to slow down, ask for confirmation, or hand control back.
#AI Agent#System Design#Production AI
Mar 10, 2026
Agent Reliability Lives in the Runtime
In production, agent behavior is shaped by the runtime around the model: which tools are visible, when retrieval happens, how retries are handled, what state is persisted, and who is allowed to commit mutations. Reliable agents require more than better prompts or stronger models. They need runtime architecture. Framework defaults, tool visibility, retry policies, and context assembly rules can change behavior even when the underlying model stays the same.
#AI Agent#System Design#AI Infra#Production AI
Mar 2, 2026
Automating the Prompt Production Line
In production LLM systems, a prompt is no longer just a string written by a human. It is a deployable artifact. This post explains how automated prompt optimization actually works: build eval sets, collect optimization signals, generate candidates, and evaluate changes in stages. Prompts become versioned, testable artifacts with eval gates, canary rollouts, observability, and rollback.
#LLMOps#AI Infra#LLM#Production AI
Feb 3, 2026
从传统摘要到语义合成
LLM 时代，摘要不再只是“把长文变短”，而是演化为上下文工程中的信息密度管理：在运行时压缩 KV Cache，在协议层裁剪低价值上下文，在应用层完成层级摘要、结构化摘要与轨迹摘要。传统摘要负责减少体积，语义合成负责重构信息，让文本成为可检索、可验证、可执行的高密度语义资产。
#AI#NLP#LLM#RAG
Feb 20, 2026
Design Agents Around Workflows, Not Chat Turns
Chat is a useful interface, but it becomes a weak system design primitive once agents are expected to complete real work. A reliable agent should advance a process, not merely generate text. That requires routing simple requests to deterministic paths, using retrieval when grounding is needed, reserving reasoning for ambiguous tasks, and separating planning from execution. For repeatable workflows, LLMs can generate structured plans while deterministic engines handle tool calls, retries, and state transitions. Production agents should be designed around explicit, inspectable, and evaluable workflow state—not reconstructed from chat history every time.
#AI Agent#Production AI#System Design
Feb 9, 2026
Routing Before Reasoning
Production agents should not send every request to the most expensive reasoning path. As reasoning models become more capable, they also introduce new production risks: higher latency, unpredictable cost, KV-cache pressure, and unnecessary “overthinking” for simple requests. Before invoking deep inference, tool use, or multi-step planning, a production agent should first decide which path is actually needed. Production agents are control systems. The real engineering value is not only in the model, but in the controller that decides when to reason, when to execute, and when to ask for human approval.
#AI Agent#Production AI#System Design

The Production Agent Stack

Building Auditable LLM Workflows for Medical Coding

World Models are becoming the simulation substrate for Agents

The Runtime Behind Production AI

Agent Observability Is Not Optional

State Is the Hard Part of Production Agents

Production Agents Run on an Autonomy Spectrum

Agent Reliability Lives in the Runtime

Automating the Prompt Production Line

从传统摘要到语义合成

Design Agents Around Workflows, Not Chat Turns

Routing Before Reasoning