Building robust, production-grade systems using Large Language Models requires moving past the script-based prototype phase. Before adopting heavy orchestration frameworks like LangChain or DSPy, engineering teams need first establish a stable, thin utility layer.
This utility layer transforms unstable, non-deterministic API interactions into predictable software components. It systematically handles message formatting, reliability (retries), data integrity (parsing/validation), and evaluation. Without it, business logic, prompt construction, model provider calls, JSON parsing, retry behavior, and evaluation logic quickly become tangled. The result is a codebase where switching models, debugging malformed outputs, or measuring prompt changes requires rewriting application logic.
A minimal LLM pipeline separates message construction, provider execution, output parsing, schema validation, retry behavior, batch execution, and evaluation into small modules that can be tested independently. The goal is to build a project structure that remains readable, debuggable, and replaceable.
1. Why Raw LLM Calls Are Not Enough
When prototyping, calling a provider API directly inside business logic is convenient:
However, raw API calls entangle your core application logic with the specific network signatures of third-party providers. They lack built-in mechanisms for exponential backoff, they assume the model will consistently return well-formed data, and they make A/B testing different models practically impossible without rewriting business logic.
Suppose your task function directly contains the model call, the prompt template, the retry logic, the JSON parser, and the schema validation. At first, this feels fast. But once the system grows, every future change becomes expensive:
- switching from one model provider to another requires editing business functions;
- adding retries risks retrying deterministic failures;
- prompt changes cannot be evaluated consistently;
- malformed outputs are hard to debug because raw responses are not captured at the right boundary;
- batch jobs cannot easily resume or isolate failures.
To build production-grade pipelines, we must decouple the intent of the application from the mechanics of the network request. Business logic should not know whether the underlying provider call is OpenAI-style
responses or chat.completions, Anthropic messages, a LangChain invoke(), or a self-hosted model endpoint.The stable abstraction is not the provider API. The stable abstraction is the contract your application owns.
2. Abstraction Levels: API, Framework, Project Wrapper
Before writing code, it is useful to distinguish three levels of LLM invocation. Mixing them leads to brittle architecture.
Provider / Ecosystem | Typical Call Style | Notes |
OpenAI | client.chat.completions.create(...)
client.responses.create(...) | chat.completions.create(...) remains widely supported across OpenAI-compatible providers (e.g., Kimi, DeepSeek, and Qwen).
For new OpenAI projects, responses.create(...) is the recommended unified API and is where new features are introduced first. |
Anthropic | client.messages.create(...) | Claude's native Messages API. |
Amazon Bedrock | client.converse(...) | The recommended API for conversational applications on Bedrock. Streaming is supported via ConverseStream. |
LangChain | model.invoke(...) | Framework-level abstraction that provides a unified interface across model providers. |
Hugging Face | model.generate(...) / pipeline(...) | Common interfaces for local inference and the Hugging Face ecosystem. |
- A provider API solves the access problem: how to send a request to a specific model endpoint.
- A framework interface solves the composition problem: how to connect models, tools, retrievers, chains, or agents within a framework ecosystem.
- A project wrapper solves the maintainability problem: how your application expects to interact with a model regardless of which provider or framework sits underneath.
This distinction matters because the three layers change at different speeds. Provider APIs evolve. Frameworks change abstractions. But your application still needs a stable internal contract.
3. Mini Project Example: ICD Code Extraction
The rest of this post builds that minimal utility layer from scratch, using a small extraction task: extracting ICD code records from OCR-style medical page text.
The input is page-level text grouped by file name and page number:
The desired output is structured JSON:
This task exercises the core mechanics of most LLM pipelines:
- prompt construction;
- provider-independent model execution;
- JSON parsing;
- schema validation;
- retry behavior;
- batch processing over files and pages;
- lineage metadata such as
file_nameandpage_num;
- evaluation against expected outputs.
ICD extraction simply gives us a concrete running example. The same project structure applies to invoice extraction, contract clause extraction, support ticket classification, document summarization, and RAG answer generation.
4. The Project Architecture
The pipeline has one simple flow:
I organize the project into a small set of modules:
The important part is the separation of responsibilities.
5. Data Boundaries: data/utils.py
Data I/O should be boring. Loading examples, saving outputs, and writing intermediate artifacts should not be mixed with model calls or prompt logic.
This layer only moves data in and out of the filesystem. Keeping this boundary clean matters because LLM pipelines often generate many artifacts:raw inputs, intermediate parsed outputs, failed examples, evaluation reports, debug traces, final extracted records. If I/O is scattered across the pipeline, debugging and reproducibility become much harder.
6. Prompt Construction: llm/prompts.py
Prompt construction should be separate from execution.
A prompt is not just a string. It is a structured input contract containing instruction, schema expectation, task context, and user payload. If the prompt is hardcoded inside the request function, the execution layer becomes domain-specific and hard to reuse.
For this example, we can split prompt construction into three small functions:
build_system_prompt(): defines role, task, output format, and rules
build_user_content(): wraps the real input data
build_messages(): produces provider-compatible chat messages
Now the the task becomes configuration:
This separation gives us a clean boundary:
- prompts.py — how to construct the model input
- client.py — how to execute the model call
The execution layer should not know what an ICD code is. The prompt layer should not know how the provider SDK works.
7. Model Call Boundary: llm/client.py
The lowest-level LLM wrapper should be generic. I prefer the name
call_llm() for this layer because it describes exactly what it does: call a language model and return the raw text.The purpose of the
call_llm() wrapper is to provide a consistent internal interface for model calls, regardless of the underlying provider. Under the hood, this wrapper might call client.converse() for Amazon Bedrock, client.messages.create() for Anthropic, client.responses.create() or client.chat.completions.create() for OpenAI, or model.invoke() inside a LangChain-based codebase.The business logic should not need to know which provider-specific API is being used. Whether the task is extraction, summarization, classification, or conversational generation, the caller interacts with the same project-owned function, while provider-specific request shapes stay encapsulated inside the wrapper. The model name is read from an environment variable rather than hardcoded throughout the codebase. This keeps model selection configurable and prevents model-specific assumptions from leaking into application logic.
call_llm() is simple:- raw model-call wrapper
- returns raw text
- no parsing
- no schema validation
- no task-specific logic
Do not make
call_llm() return a JSON dictionary by default. That would make it unusable for summarization, rewriting, question answering, or other free-form generation tasks.Instead, keep it generic:
Then parse and validate at the next layer.
8. Parsing: llm/parsing.py
LLMs output text. Pipelines need structured data. Even when we ask for JSON, the runtime still receives a string. The parser is responsible for crossing the first boundary: turning raw model text into a Python dictionary.
Parsing only proves that the output is syntactically valid JSON and has a top-level shape the pipeline can work with.
9. Validation: Schema as a Data Contract
Valid JSON is not the same as usable data. A model can return syntactically valid JSON that still violates the expected schema:
That is why parsing and validation should be separate steps.
A parser handles syntax. A schema validator handles shape. A domain validator can handle deeper semantic constraints, such as valid ICD format, normalized dates, or evidence spans.
For structured extraction, do not rely only on the prompt instruction “return JSON only.” That is a soft constraint. Before writing the Pydantic model, define the response envelope. Instead of asking the model to return a bare list:
prefer a top-level object:
The top-level object represents the response envelope, while
records represents the extracted entities. This gives the output contract room to grow. Later, the same schema can support metadata, warnings, confidence scores, source spans, or page-level diagnostics without breaking downstream consumers:This structure also makes zero-result extraction explicit. For extraction tasks, the correct answer is often “nothing found.” The model should not be forced to invent an empty record or return an ambiguous
null. An empty list is a valid and clean output:Pydantic is a common choice for enforcing this contract because it turns raw dictionaries into typed objects and produces clear validation errors.
The value of the schema layer is that it prevents invalid model outputs from being silently accepted by your program. If the model returns the wrong shape, misses required fields, or uses the wrong type, validation fails. Once validation fails, the pipeline has an explicit decision point:
- retry
- repair
- fallback
- log for review
- send to manual QA
That is the difference between probabilistic generation and production software. The model can still be wrong, but the system should not quietly treat malformed output as valid data.
This creates three distinct layers:
call_llm():returns raw text
parse_json_output():checks JSON syntax and top-level object shape
validate_schema():checks whether JSON matches the expected Pydantic model
10. Robust Execution: llm/runners.py
The runner layer turns individual model calls into a reliable data processing operator.
There are two responsibilities here:
- retry model calls when failures are plausibly recoverable;
- apply the pipeline repeatedly over pages, documents, or chunks.
This implementation retries network failures, malformed JSON outputs, and schema mismatches. In practice, the retry layer should distinguish between recoverable and non-recoverable failures.
Recoverable failures include:
- timeout
- connection reset
- rate limit
- malformed or truncated JSON output
- temporary provider error
Non-recoverable failures include:
- invalid credentials
- unsupported parameters
- context length overflow
- invalid model name
- schema design errors
In a production implementation, provider-specific exceptions should be mapped into typed errors before reaching the retry layer. This minimal version keeps the code readable while showing the intended control flow.
The batch runner is where the LLM call becomes a data processing operator. It also attaches lineage metadata, such as
file_name and page_num, to every extracted record. That metadata is not decorative. It is what lets the user audit where an extraction came from.11. Evaluation: llm/evals.py
Prompt engineering is subjective until it is measured.
For extraction tasks, a simple starting point is record-level exact match. We canonicalize dictionaries into stable JSON strings so that list order does not affect scoring.
This scorer is intentionally strict. It works well for small canonical extraction payloads, but it is not the only evaluation strategy.
For real extraction systems, field-level precision and recall are often more useful than full-record exact match. For summarization or open-ended generation, exact match is usually the wrong metric entirely. In those cases, evaluation may require rubric-based scoring, LLM-as-judge, human review, or task-specific validators.
The point is not that exact match solves evaluation. The point is that evaluation must exist as a first-class module.
Without an eval loop, prompt engineering is guessing.
12. Integration: main.py
Finally, the pipeline becomes an operator that consumes raw data and produces verifiable output.
At this point, the application has a clean flow:
Each component has one job. Each component can be tested. Each component can be replaced.
That is the difference between a notebook demo and a maintainable LLM pipeline.
13. Useful Libraries for LLM Pipelines
A minimal pipeline can be built with the Python standard library and one provider SDK. As the project grows, these libraries become useful:
Category | Common Libraries | Purpose |
LLM API | openai | Calling OpenAI-compatible model endpoints |
Environment Config | os, python-dotenv | API keys, model names, base URLs |
JSON Parsing | json | Parsing model outputs |
Schema Validation | pydantic | Validating structured outputs |
Retry | tenacity | Backoff, retry policies, transient failure handling |
Progress Bar | tqdm | Tracking batch jobs |
Prompt Templates | jinja2 | Rendering complex prompt templates |
Data Handling | pandas | Working with tabular input and outputs |
Token Counting | tiktoken or approximate counters | Managing context length |
Logging | logging, jsonlines | Structured logs and result persistence |
Parallelism | concurrent.futures, asyncio | Processing pages or documents concurrently |
Do not start by installing every framework. Start by making the pipeline boundaries explicit. Then add libraries where they remove real friction.
Takeaway
A minimal LLM pipeline is not a framework. It is a boundary layer.
It separates data I/O, prompt construction, model execution, parsing, validation, retries, batch processing, and evaluation into small software components. Once these responsibilities are separated, switching providers, testing prompts, debugging malformed outputs, and scaling to batch workloads become ordinary software engineering problems rather than notebook hacks.
Frameworks solve ecosystem integration. Foundational wrappers solve long-term maintainability.
- Author:Fan Luo
- URL:https://fanluo.me/article/building-a-minimal-llm-pipeline-from-scratch
- Copyright:All articles in this blog adopt BY-NC-SA agreement. Please indicate the source!
上一篇
[Leetcode 1813] 句子相似性 III
下一篇
Production Agents Need Workflow Graphs
