This post builds a minimal LLM extraction pipeline from scratch using ICD code extraction as the running example.  The pipeline includes data I/O, a thin LLM wrapper, prompt construction, JSON parsing, schema validation, retry logic, batch execution, and evaluation. The goal is to show the foundational utility layer that makes LLM applications testable, debuggable, provider-agnostic, and production-ready.

Building a Minimal LLM Pipeline

The production abstraction for AI is not a “smarter agent loop.” It is a stateful execution graph: a system organized around explicit state transitions, dependency-aware scheduling, versioned plans, deterministic control boundaries, and carefully managed side effects.
Intelligence alone is insufficient. Production reliability emerges from how execution, authority, recovery, and state mutation are structured across the graph.

Production Agents Need Workflow Graphs

Medical coding is a high-stakes extraction and verification problem, not a simple text generation task. Asking an LLM to read a long clinical note and directly output ICD codes risks hallucinated mappings, missed comorbidities, and results that are difficult for human coders to audit. A reliable medical coding system may benefit from an LLM-assisted workflow: extract clinical evidence, retrieve candidate codes, verify mappings, validate against the taxonomy, and route uncertainty to human review. The model should not be expected to memorize every code. Its job is to help produce auditable evidence inside a controlled workflow.

Building Auditable LLM Workflows for Medical Coding

A layered framework for scaling production AI systems begins with the SLA: latency, throughput, reliability, cost per resolved task, fallback behavior, and quality targets. Those requirements drive the architecture of the runtime — spanning the edge gateway, safety and governance, orchestration and routing, inference serving, compute scheduling, context and state management, model lifecycle operations, and observability.

The Runtime Behind Production AI

As AI agents move from short-lived chat interactions to long-running autonomous systems, the hardest engineering problems are no longer about prompts or model quality. They are about state management, replay safety, memory hierarchy, checkpointing, and transactional execution. Production agents need a cache-aware, transactional runtime. Agent state should not be a probabilistic byproduct of a chat log; it should be a deterministic projection of validated events.

State Is the Hard Part of Production Agents

In production LLM systems, a prompt is no longer just a string written by a human. It is a deployable artifact.
This post explains how automated prompt optimization actually works: build eval sets, collect optimization signals, generate candidates, and evaluate changes in stages. Prompts become versioned, testable artifacts with eval gates, canary rollouts, observability, and rollback.

Automating the Prompt Production Line

Chat is a useful interface, but it becomes a weak system design primitive once agents are expected to complete real work. A reliable agent should advance a process, not merely generate text. That requires routing simple requests to deterministic paths, using retrieval when grounding is needed, reserving reasoning for ambiguous tasks, and separating planning from execution. For repeatable workflows, LLMs can generate structured plans while deterministic engines handle tool calls, retries, and state transitions. Production agents should be designed around explicit, inspectable, and evaluable workflow state—not reconstructed from chat history every time.

Design Agents Around Workflows, Not Chat Turns

Production agents should not send every request to the most expensive reasoning path. As reasoning models become more capable, they also introduce new production risks: higher latency, unpredictable cost, KV-cache pressure, and unnecessary “overthinking” for simple requests. Before invoking deep inference, tool use, or multi-step planning, a production agent should first decide which path is actually needed. Production agents are control systems. The real engineering value is not only in the model, but in the controller that decides when to reason, when to execute, and when to ask for human approval.

Routing Before Reasoning

在大语言模型（LLM）驱动的范式下，“摘要”已不再只是面向人类读者的短文本生成任务，而是逐渐演变为机器对机器（M2M）的语义合成算子。它的核心不只是压缩文本长度，而是建立一套从非结构化文本到结构化中间表示（IR）的编译机制，将原始材料转化为可消费、可检索、可追溯、可验证、可执行的高密度语义资产。要落地这一合成管线，系统必须依托上下文工程（Context Engineering）进行全生命周期治理：决定哪些信息可以进入，哪些信息需要保留，如何压缩、组织、呈现，以及如何评估其质量。

从传统摘要到语义合成

A reliable agent is not just an LLM connected to tools. A production agent stack is a system of layered responsibilities.
The runtime owns execution state and governs workflow progression. The planner proposes next steps, but proposals are not execution. Memory provides contextual recall without serving as the source of truth. Agent interoperability enables structured delegation, while tools expose external capabilities through standardized protocols such as MCP.
Validation transforms probabilistic model outputs into structured, policy-constrained proposals that can safely enter the execution pipeline. Execution itself occurs inside isolated runtime environments where side effects can be controlled, audited, recovered, or rolled back.

The Production Agent Stack

This post walks through the implementation of a minimal invoice-processing agent. The agent parses an invoice, verifies it against a ledger, requests approval when needed, and writes the final entry only after validation. The core pattern is simple: state constrains actions, the planner proposes one, validation gates it, tools return observations, the reducer updates state, and the runtime decides whether to stop. Before adopting complex orchestration frameworks, build this loop first.

Building a Simple Invoice-processing Agent

Search is no longer just a user-facing answer interface. In production agent systems, it is becoming the context acquisition layer of the agent runtime.
Traditional search returned ranked documents and left the user to interpret results. Early RAG systems followed a similar pattern: retrieve evidence, inject it into the prompt, and generate a response. But agents use search differently. They invoke search as an internal workflow step to clarify intent, retrieve evidence, choose tools, verify state, inspect logs, and recover from failures.

Search Is Becoming Agent Infrastructure

Agentic search engines—such as Google AI Mode, Perplexity, Bing Copilot, ChatGPT Search no longer means “type keywords, get ten blue links.” AI Search experience capable of understanding tasks, planning queries, calling tools, and synthesizing results and deliver a conversational response with inline citations, minimizing user effort.  In this post, I’ll walk through the stack from bottom to top, how it crawls and indexes pages, how it retrieves and ranks information, and how recent features like RAG and Agentic search build upon these foundations.

Demystifying Agentic Search Engines

Building Modern Recommendation Systems introduces a comprehensive, end-to-end pipeline that drives intelligent recommendations. The post walks through the full machine learning workflow — from raw data preparation and feature engineering to model training, deployment, real-time inference, and system monitoring. 

Modern Recommendation System Infrastructure

This post explores the full RecSys architecture, emphasizing the core models that drive each stage of the RecSys pipeline — from Retrieval for large-scale candidate generation, to Pre-ranking for efficient filtering, Ranking for fine-grained relevance modeling, and Re-ranking for balancing diversity and control.

Design a Modern Recommendation System

Building production ML systems is far more than selecting a model. Success requires thinking in terms of a full lifecycle: defining precise functional and non-functional requirements, designing robust data pipelines, splitting logic between models and rules, versioning and deploying models, prompts, and embeddings as coherent units, and continuously monitoring system performance and product impact.

The ML Factory: Building Production ML Systems

本文系统回顾了深度学习的发展脉络，从基础神经网络到Attention 与 Transformer的出现，再到深度生成模型的兴起，最后介绍了多模态与统一建模架构的发展趋势，展示了当前主流的模型体系。

深度学习模型架构的演进

本文简要总结了深度学习在NLP、计算机视觉、信息检索和推荐系统四大主流领域的演进脉络：从早期RNN、CNN等专用模型，到Transformer全面主导，再到如今BERT/GPT、ViT、Diffusion等预训练大模型横扫各领域。核心趋势是预训练+生成式范式取代传统任务特定模型，统一建模与生成式架构正在加速推动各领域融合与新一轮创新。

各领域的深度学习模型

工程实验中的假设检验

大模型（LLM）关键技术：从基础到落地

机器学习模型：从传统算法到生成式AI

ML 模型生产全流程

NLP技术与应用：从语言理解到智能生成

模型训练的方法与实践

Text Similarity and Retrieval Basics

Essential Evaluation Metrics for Applied ML Systems

Retrieval-Augmented Generation (RAG) combines large language models with external knowledge retrieval to produce more accurate and grounded responses. The post explains why RAG was introduced, explores its key use cases and real-world applications, and discusses challenges and considerations that impact performance in practical deployments.

Retrieval-Augmented Generation (RAG)

Essential Loss Functions for Machine Learning

Statistical Tests by Data Type

本文探讨了机器学习如何推动人与机器的自然交流，从早期的对话系统到如今能够理解意图、执行任务的智能助理。近年来的趋势是向LLM + Agent 化对话系统演进，LLM 可嵌入架构中各核心模块，增强系统的理解、生成与决策能力。最终，通过引入智能代理机制，让对话系统从“能说”进一步迈向“能做”。

对话系统：从人机交流走向理解与互动

Complex Experimentation Beyond Standard A/B Testing

Causal Inference Beyond Randomized A/B Tests

Building the Release Gate: A/B Testing Framework Design

K-Means Clustering

Logistic Regression from Scratch

Linear Regression from Scratch

Binary search is an efficient searching technique based on the divide-and-conquer principle. By repeatedly narrowing the search space, it guarantees a worst-case time complexity of O(log n). It is well-suited for sorted data, monotonic arrays, and optimization problems where the goal is to find the best value. Common use cases include exact matching, boundary and insertion point searches, finding the closest element, and performing binary search on the answer space. 

Shrinking the Search Space with Binary Search

The two-pointer technique is essential for optimizing operations on arrays, strings, and lists, often reducing time complexity from O(n²) to O(n). Common patterns include opposing pointers, sliding windows, fast–slow pointers, and dual-input pointers—each suited to different problem types such as finding pairs, subarrays, or merging sorted lists. 

Solving Problems with the Two-Pointers Technique

Topological Sorting Explained: Sorting Dependency Chains

Prefix search is a fundamental operation in computer science, typically implemented using a Trie (prefix tree). A Trie is a dynamic data structure for storing a collection of strings, supporting efficient insertion, lookup, and enumeration operations. Tries and their variants provide a powerful and efficient way to manage and query large volumes of string data.

Implementing Efficient Prefix Search with Tries

Depth-First Search: Exploring Deep Before Wide

Breadth-First Search: Level-Order Exploration

Recursion is a core computational concept where a problem is solved by calling itself on smaller instances. Recursion is key to many algorithms: DFS (Depth-First Search) is often implemented recursively, Dynamic Programming is fundamentally recursion with caching (memoization), and Divide & Conquer uses recursion to split problems into independent subproblems.

Understanding Recursion: Functions That Call Themselves

找出其中不含有重复字符的 最长子串的长度。

[Leetcode 3] 无重复字符的最长子串

找出并返回这两个升序数组的中位数 。

[Leetcode 4] 两个排序数组的中位数

找字符串 s 中最长的回文子串。

[Leetcode 5] 最长回文子串

找出数组 height中的两条线，使得它们与 x 轴共同构成的容器可以容纳最多的水。

[Leetcode 11] 盛最多水的容器

查找字符串数组中的最长公共前缀。

[Leetcode 14] 最长公共前缀

[Leetcode 17] 电话号码的字母组合

返回满足条件且不重复的四元组 [nums[a], nums[b], nums[c], nums[d]] 

[Leetcode 18] 四数之和

[Leetcode 19] 删除链表的倒数第 N 个节点

[Leetcode 20] 有效的括号

将两个升序链表合并为一个新的升序链表并返回

[Leetcode 21] 合并两个有序链表

将所有升序链表合并到一个升序链表中

[Leetcode 23] 合并 K 个升序链表

每 k 个节点一组进行翻转，返回修改后的链表。

[Leetcode 25] K 个一组翻转链表

找出最长有效（格式正确且连续）括号子串的长度

[Leetcode 32] 最长有效括号

在旋转后的数组中找到target的index

[Leetcode 33] 搜索旋转排序数组   

找出给定目标值在排序数组中的开始位置和结束位置。

[Leetcode 34] 在排序数组中查找元素的第一个和最后一个位置

给定无重复正整数数组 candidates，每个数可重复使用，返回所有和为 target 的不重复组合。

[Leetcode 39]  组合和

n 个柱子，下雨之后能接多少雨水。

[Leetcode 42] 接雨水

匹配字符串 (s) 和字符模式 (p), 支持 '?' 和 '*' 

[Leetcode 44] 通配符匹配

给定一个不含重复数字的数组 nums，返回其所有可能的全排列 。

[Leetcode 46] 全排列

给定数组 nums，nums[i] 表示从位置 i 最多可以往右跳多少步。从0 出发，能不能到达n-1。

[Leetcode 55] 跳跃游戏

位于网格左上角的机器人总共有多少条不同的路径达到网格的右下角

[Leetcode 62] 不同路径

word1 转换成 word2 所需的最少操作数

[Leetcode 72] 编辑距离

[Leetcode 79] 单词搜索

删除已排序的链表中所有重复的元素

[Leetcode 83] 删除排序链表中的重复元素

柱状图能够勾勒出来的矩形的最大面积。

[Leetcode 84] 柱状图中最大的矩形

合并两个按非递减顺序排列的整数数组

[Leetcode 88] 合并两个有序数组

反转从位置left到位置right之间的链表节点

[Leetcode 92] 反转链表II

给定一个二叉树的根节点root，检查它是否轴对称。

[Leetcode 101] 对称二叉树

逐层地，从左到右访问二叉树的所有节点

[Leetcode 102] 二叉树的层序遍历

给定二叉树的preorder 和 inorder 数组，构造出二叉树并返回其根节点。

[Leetcode 105] 从前序与中序遍历序列构造二叉树

给 完美二叉树 的所有节点填充一个指向右侧节点的next指针

[Leetcode 116] 填充每个节点的下一个右侧节点指针

[Leetcode 121] 买卖股票的最佳时机

[Leetcode 122] 买卖股票的最好时间 II

[Leetcode 123] 买卖股票的最佳时机 III

二叉树中任意两节点之间路径的最大和，，路径不要求经过根节点。

[Leetcode 124] 二叉树中的最大路径和

所有从beginWord转化到endWord的最短转换序列

[Leetcode 126] 单词接龙 II

数组中最长连续整数序列的长度

[Leetcode 128] 最长连续序列

[Leetcode 138] 随机链表的复制

给定字符串 s 和词典 wordDict，返回所有能把 s 拆成词典单词的句子。

[Leetcode 140]单词拆分 II 

按 L0 → Ln → L1 → Ln-1 → ... 的顺序重新连接链表

[Leetcode 143] 重排链表

设计并实现 LRU 缓存，get 和 put 必须O(1)  时间

[Leetcode 146] LRU 缓存

给出链表的头结点 head ，请将其按 升序 排列并返回 排序后的链表 。

[Leetcode148] 链表排序

给定整数数组 nums，返回乘积最大的非空连续子数组的乘积。

[Leetcode152]  最大乘积子数组

找出两个链表的相交的起始节点。

[Leetcode 160] 相交链表

将 Excel 列名称转换为对应的列号。

[Leetcode 171] Excel 列名转换为数字

重新排列数组中每个数的顺序，组成最大的整数。

[Leetcode 179] 最大数

网格中被水(0)包围的岛屿的数量

[Leetcode 200] 岛屿数量

给定整数 n ，返回 所有小于非负整数 n 的质数的数量 。

[Leetcode 204] 质数计数

[Leetcode 206] 反转链表

实现 Trie 类：初始化、插入字符串 、检索、前缀检索

[Leetcode 208] 实现 Trie

给定一个只包含非负整数、空格，以及 + - * / 的字符串表达式，计算它的值。

[Leetcode 227] 基础计算器 II

给定一个二叉树, 找到该树中两个指定节点的最近公共祖先。

[Leetcode 236] 二叉树的最近公共祖先 

大小为 k 的滑动窗口从数组的最左侧移动到数组的最右侧，返回 滑动窗口中的最大值 。

[Leetcode 239] 滑动窗口最大值

搜索 m x n 的排序矩阵中的一个目标值 target 。

[Leetcode 240] 搜索二维矩阵 II

给定一个二维数组 vec，实现一个迭代器，支持next 和 hasNext 两种操作。

[Leetcode 251] 展开二维向量

[Leetcode 300]  最长递增子序列

给定一个可能含有 重复元素 的整数数组 nums。随机等概率地输出一个值等于 target 的索引。