AI research, systems, and engineering notes.
Building Auditable LLM Workflows for Medical Coding
Medical coding is a high-stakes extraction and verification problem, not a simple text generation task. Asking an LLM to read a long clinical note and directly output ICD codes risks hallucinated mappings, missed comorbidities, and results that are difficult for human coders to audit. A reliable medical coding system may benefit from an LLM-assisted workflow: extract clinical evidence, retrieve candidate codes, verify mappings, validate against the taxonomy, and route uncertainty to human review. The model should not be expected to memorize every code. Its job is to help produce auditable evidence inside a controlled workflow.
- Jun 6, 2026
Production Agents Need Workflow Graphs
The production abstraction for AI is not a “smarter agent loop.” It is a stateful execution graph: a system organized around explicit state transitions, dependency-aware scheduling, versioned plans, deterministic control boundaries, and carefully managed side effects. Intelligence alone is insufficient. Production reliability emerges from how execution, authority, recovery, and state mutation are structured across the graph.
#AI Agent#Production AI#System Design - May 2, 2026
The Runtime Behind Production AI
A layered framework for scaling production AI systems begins with the SLA: latency, throughput, reliability, cost per resolved task, fallback behavior, and quality targets. Those requirements drive the architecture of the runtime — spanning the edge gateway, safety and governance, orchestration and routing, inference serving, compute scheduling, context and state management, model lifecycle operations, and observability.
#AI Agent#Production AI#AI Infra#System Design - Apr 6, 2026
State Is the Hard Part of Production Agents
As AI agents move from short-lived chat interactions to long-running autonomous systems, the hardest engineering problems are no longer about prompts or model quality. They are about state management, replay safety, memory hierarchy, checkpointing, and transactional execution. Production agents need a cache-aware, transactional runtime. Agent state should not be a probabilistic byproduct of a chat log; it should be a deterministic projection of validated events.
#AI Agent#AI Infra#Production AI - Mar 2, 2026
Automating the Prompt Production Line
In production LLM systems, a prompt is no longer just a string written by a human. It is a deployable artifact. This post explains how automated prompt optimization actually works: build eval sets, collect optimization signals, generate candidates, and evaluate changes in stages. Prompts become versioned, testable artifacts with eval gates, canary rollouts, observability, and rollback.
#LLMOps#AI Infra#LLM#Production AI - Feb 20, 2026
Design Agents Around Workflows, Not Chat Turns
Chat is a useful interface, but it becomes a weak system design primitive once agents are expected to complete real work. A reliable agent should advance a process, not merely generate text. That requires routing simple requests to deterministic paths, using retrieval when grounding is needed, reserving reasoning for ambiguous tasks, and separating planning from execution. For repeatable workflows, LLMs can generate structured plans while deterministic engines handle tool calls, retries, and state transitions. Production agents should be designed around explicit, inspectable, and evaluable workflow state—not reconstructed from chat history every time.
#AI Agent#Production AI#System Design - Feb 9, 2026
Routing Before Reasoning
Production agents should not send every request to the most expensive reasoning path. As reasoning models become more capable, they also introduce new production risks: higher latency, unpredictable cost, KV-cache pressure, and unnecessary “overthinking” for simple requests. Before invoking deep inference, tool use, or multi-step planning, a production agent should first decide which path is actually needed. Production agents are control systems. The real engineering value is not only in the model, but in the controller that decides when to reason, when to execute, and when to ask for human approval.
#AI Agent#Production AI#System Design - Feb 3, 2026
从传统摘要到语义合成
在大语言模型(LLM)驱动的范式下,“摘要”已不再只是面向人类读者的短文本生成任务,而是逐渐演变为机器对机器(M2M)的语义合成算子。它的核心不只是压缩文本长度,而是建立一套从非结构化文本到结构化中间表示(IR)的编译机制,将原始材料转化为可消费、可检索、可追溯、可验证、可执行的高密度语义资产。要落地这一合成管线,系统必须依托上下文工程(Context Engineering)进行全生命周期治理:决定哪些信息可以进入,哪些信息需要保留,如何压缩、组织、呈现,以及如何评估其质量。
#AI#NLP#LLM#RAG - Jan 25, 2026
The Production Agent Stack
A reliable agent is not just an LLM connected to tools. A production agent stack is a system of layered responsibilities. The runtime owns execution state and governs workflow progression. The planner proposes next steps, but proposals are not execution. Memory provides contextual recall without serving as the source of truth. Agent interoperability enables structured delegation, while tools expose external capabilities through standardized protocols such as MCP. Validation transforms probabilistic model outputs into structured, policy-constrained proposals that can safely enter the execution pipeline. Execution itself occurs inside isolated runtime environments where side effects can be controlled, audited, recovered, or rolled back.
#AI Agent#Production AI#System Design - Jan 16, 2026
Building a Simple Agent from Scratch
This post walks through the implementation of a minimal invoice-processing agent. The agent parses an invoice, verifies it against a ledger, requests approval when needed, and writes the final entry only after validation. The core pattern is simple: state constrains actions, the planner proposes one, validation gates it, tools return observations, the reducer updates state, and the runtime decides whether to stop. Before adopting complex orchestration frameworks, build this loop first.
#AI Agent#Applied AI#System Design - Dec 16, 2025
Search Is Becoming Agent Infrastructure
Search is no longer just a user-facing answer interface. In production agent systems, it is becoming the context acquisition layer of the agent runtime. Traditional search returned ranked documents and left the user to interpret results. Early RAG systems followed a similar pattern: retrieve evidence, inject it into the prompt, and generate a response. But agents use search differently. They invoke search as an internal workflow step to clarify intent, retrieve evidence, choose tools, verify state, inspect logs, and recover from failures.
#AI Agent#AI Infra#Search#Production AI - Nov 19, 2025
Demystifying Agentic Search Engines
Agentic search engines—such as Google AI Mode, Perplexity, Bing Copilot, ChatGPT Search no longer means “type keywords, get ten blue links.” AI Search experience capable of understanding tasks, planning queries, calling tools, and synthesizing results and deliver a conversational response with inline citations, minimizing user effort. In this post, I’ll walk through the stack from bottom to top, how it crawls and indexes pages, how it retrieves and ranks information, and how recent features like RAG and Agentic search build upon these foundations.
#System Design#RAG#Retrieval#LLM