The ML Factory: Building Production ML Systems

type

status

date

summary

ML system ≠ Model

It’s tempting to treat “the model” as the whole story: logistic regression, a ResNet, an LLM, a diffusion model. You run your experiments, get a good ROC curve or BLEU score, and it feels like the problem is solved. However, model is just one component in the system. Production system is everything around the model:

Managing messy data and labels across data lakes, streams, and warehouses

Serving predictions to millions of users with tight SLAs

Deciding what the model should do vs what deterministic code should do

Constant monitoring and operational work to keep it all from quietly decaying.

Navigating privacy, compliance, and safety constraints

When we talk about ML system design, we’re really talking about:

Turning a vague product idea into a robust, observable, privacy-compliant system that just happens to use AI/ML models.

Start With the Problem

To build a machine learning system — whether it’s detecting harmful videos, generating images, or summarizing email — the first step is to define the problem clearly. This means understanding two things: what the system should do (its core function) and how it should behave (its performance, reliability, and ethical boundaries).

On the “do” side, we clarify functional requirements: what is the exact input and exact output?

For a harmful-video detector, that might mean: given a video plus some metadata, output a risk score and a label that can drive decisions in ranking, recommendation, and enforcement.

On the “behave” side, we define non-functional requirements: latency, throughput, scale, fairness, safety, availability, cost, and legal constraints.

A moderation system may have very different speed requirements depending on where it’s used. It might be allowed a few seconds per video in an offline pipeline, but only a few hundred milliseconds in an online search ranking system. Those numbers will constrain which models, hardware, and serving patterns are even viable.

We also want to be explicit about the business goal. “Achieve 97% precision” is useless if the real target is “reduce user complaints about harmful content by 50% while keeping watch time flat.” That gap between ML metrics and product outcomes is where many “successful” models quietly fail.

Only after understanding the problem, we then ask: “What exactly is the ML task here?”

Turning requirements Into ML Tasks

ML solves well-posed prediction or generation tasks.

Once we understand the product requirements, we need to translate them into:

Well-defined inputs and outputs

A concrete ML formulation (classification, ranking, generation, etc.)

For a chatbot, the input might be user text (plus conversation history), and the output is a text response. For a Street View blur system, the input might be an image or image tiles, and the output is bounding boxes or segmentation masks for everything that should be blurred.

At this stage you’re also choosing between broad families of methods. Is this fundamentally discriminative (predict labels, scores, or ranks), or generative (produce new text, images, or videos)? Some systems combine both. For example, we might use a discriminative model to decide which utterance triggers a generation task , then a generative model to write response or summarize information.

The goal isn’t to pick a specific architecture yet. It’s to frame the ML problem cleanly so that:

we know what data to use

we can define meaningful metrics

we have a clear contract between model and the rest of the system

Deciding What the Model Should Do (and What It Shouldn’t)

A crucial design decision that often gets skipped is: where exactly does “intelligence” live in the system?

In practice, production systems mix ML and traditional deterministic logic:

The model is good at things like natural language understanding, generation, prediction and ranking.

Traditional deterministic code is better at enforcing rules: “this user is under 18,” “this field must be non-empty,” “this request exceeds rate limits,” “never show content from this banned list,” “always log these events.”

For example, a Street View blur system might rely on models to detect faces and license plates, while deterministic code enforces aggressive blurring of all detected faces prior to storage or display. Recommendation systems often rely on machine learning to predict metrics like click-through rates or watch time, but strict business constraints—such as legal requirements, content policies, or contractual obligations—are then enforced with rules.

To architect a robust system, we should define the boundaries of authority explicitly. This means explicitly defining where the model has freedom to make decisions and where the human architect enforces control through code and rules. Key questions to guide this process include:

Which decisions we delegate to the model?

Which aspects remain under explicit human-coded rules and constraints?

How should the model’s outputs and the rule-based components interact and communicate?

Alongside this, building a real system is a team sport. Product managers, ML engineers, data engineers, DevOps/MLOps, and security/compliance all see different parts of the elephant. A “successful model” that ignores any one of those perspectives tends to fail in production.

Data Pipelines

When applying a “perfect” research model to the business world, the biggest challenge isn’t the architecture—it’s the data.

Data for real systems is spread across event logs, relational databases, object storage, third-party APIs, and whatever else the organization has accumulated. Some of it is labeled, much of it isn’t. Some of it is full of PII and legal landmines. Almost all of it is messy.

It is necessary to design pipelines that answer a few basic questions:

What data do we need to train the model?

What data do we need at inference time?

Where does that data live, and how fresh must it be?

How do we clean it, anonymize it, and avoid leakage?

And it will usually end up with at least two flows:

A batch flow for training and offline evaluation, feeding off data in a data lake or warehouse.

A streaming or near-real-time flow for updating features, embeddings, or indices used by online serving.

Under the hood, we make decisions about ETL jobs, stream processors, schema management, feature stores, and storage formats. But conceptually, what matters is that we treat data as a first-class system component.

Privacy and compliance also belong here from day one. We have to know which data is used only for inference, which is used for training, how consent is handled, and how we would respond if regulators or users ask what we did with it.

Building the Model in Context

Once we have a clear problem definition and a solid plan for data, we can finally talk about models.

We still start with baselines – simple models or even heuristic rules – because they give us a sanity check and a fast feedback loop. For many problems, a carefully engineered logistic regression, gradient boosting model, or simple dual-encoder beats a fancy architecture that’s poorly integrated.

From there, we consider more complex options: deep networks, transformers, multimodal models, pre-trained LLMs or vision models, fine-tunes, adapters, retrieval-augmented setups, and so on.

Every choice has to be justified relative to:

Latency budgets and hardware constraints

Data availability and label quality

Interpretability needs

Operational cost and scalability

Training details (optimizers, distributed training strategies, mixed precision, gradient checkpointing, etc.) absolutely matter, but at the system level we care most about:

Whether the objectives and loss functions reflect what the product cares about, not just what is easy to optimize.

Whether the training pipeline can be run repeatedly and reliably,

Whether we can scale training as data and models grow,

The model is not a one-off artifact. It’s part of a repeatable process.

Embeddings, Vector Stores, and the Versioning Trap

In modern LLM and RAG systems, embeddings and vector stores have become core infrastructures.

Two key principles to keep in mind:

Effective retrieval requires that the embedding model used for querying a vector index be the same as the one used to generate the stored vectors. Without this alignment, similarity scores become meaningless. Switching models isn’t a trivial config change—it typically necessitates re-embedding the dataset and reconstructing the index.

Embeddings should be treated as versioned artifacts, since changes to the embedding model or its parameters can alter the semantic space and invalidate prior indices. For each index, we need know:

Which embedding model and version produced its vectors,

What preprocessing was done on the text or other inputs,

What the vector dimension and index structure are.

In production RAG pipelines, a crucial three-way dependency must not be overlooked: the LLM version, the prompt used for querying, and the embedding model along with its index. Changing any one of these elements often necessitates revisiting the others. A robust system treats this whole bundle as a versioned unit, with the ability to roll forward or back as one.

LLMOps / MLOps: Life After “It’s Deployed”

Deploying a model is not the finish line. It’s the start of a long, unglamorous stretch of work we call MLOps or LLMOps.

One important piece is version control for models, prompts, and configurations. In a non-trivial system, multiple versions live at once: maybe an old model serving 90% of traffic, a new one serving 10%, a shadow deployment observing requests but not affecting users, and a few internal variants for specific teams.

To let a whole team roll out and roll back versions without guessing which feature uses what, we need:

A registry for models, prompts, and indices

Clear metadata: where each version is used, what metrics it achieved, when it was deployed

Simple mechanisms to route traffic, run A/B tests, and revert changes safely

Equally important is monitoring. While standard system metrics such as latency, error rate, and throughput remain essential, ML-specific and product-level signals must also be tracked.

For ML metrics, track accuracy, recall, calibration, or hallucination rates using held-out slices drawn from production data. Equally important are product metrics—real-world behavior such as active users, conversation length, containment rate, click-throughs, conversions, and complaint rates. These reveal whether the system truly helps users and drives business value, regardless of how good the offline validation curves appear.

If active users drop after a new model launch—even if offline metrics improve—you haven’t delivered a better system. You’ve earned a research win but taken a product loss.

Privacy, Safety, and Compliance as Design Constraints

Any system that touches user data or generates content at scale will eventually collide with privacy, safety, and legal constraints. It’s much cheaper to acknowledge this early than to patch it later.

On the privacy side, the big questions are:

What user data do we even need?

Are we using it only for inference, or also for training and evaluation?

Do we need explicit consent to use certain kinds of data in training?

How do we anonymize, pseudonymize, or aggregate data to reduce risk?

For many contexts, the safe answer is to default to no training on raw user content unless users have clearly opted in, and even then to strip identifiers early in the ingestion process. We also need clear policies for access control, audit logs, and data retention.

On the safety and fairness side, models inherit the biases and toxic content of the data they were trained on. We can’t just shrug and say “that’s how the internet is.” For systems that generate or filter content – chatbots, recommendation engines, search assistants, creative tools – we need layers of moderation and safety checks around the model, plus ongoing red-teaming and bias analysis.

If the system operates in sensitive domains like health or finance, expectations are even higher: users are not just annoyed if the model is wrong or offensive; they may be harmed.

Evaluation: Offline & Online

Evaluating ML models in this context is a two-stage process.

Offline evaluation. This is where we use held-out datasets and labels to compute metrics: accuracy, precision, recall, F1 scores, ROC curves, nDCG for ranking, and task-specific scores for generation (BLEU/ROUGE-like metrics for text, FID-like metrics for images, and so on).

These metrics are necessary. They enable rapid iteration, help detect obvious regressions, and allow comparison of candidate models before the risk of deployment.

But offline metrics are not sufficient. As soon as the system interacts with real users and other systems, there are effects we can’t see in an evaluation set:

Users might change their behavior in response to new recommendations.

Workers might adapt how they label or review content.

Spammers and adversaries might adapt to exploit new weaknesses.

Online evaluation: A/B tests, canary rollouts, shadow deployments, and continuous monitoring of product metrics.

In a ranking or recommendation system, key metrics include click-through rate, watch time, satisfaction scores, and churn.

In a moderation system, relevant metrics might include complaint rates, appeal rates, or the prevalence of harmful content.

The right way to think about this is that offline metrics filter the options, but online metrics decide the winners.

Final Thoughts

Building production ML systems is far more than selecting a model. Success requires thinking in terms of a full lifecycle: defining precise functional and non-functional requirements, designing robust data pipelines, splitting logic between models and rules, versioning and deploying models, prompts, and embeddings as coherent units, and continuously monitoring system performance and product impact.

From the outside, ML system looks like magic—data goes in, intelligence comes out. From the inside, it’s a complex factory, where every component—data, models, infrastructure, policies, and UX—must work together to deliver real-world value to users and the business.

The ML Factory: Building Production ML Systems

ML system ≠ Model

Start With the Problem

Turning requirements Into ML Tasks

Deciding What the Model Should Do (and What It Shouldn’t)

Data Pipelines

Building the Model in Context

Embeddings, Vector Stores, and the Versioning Trap

LLMOps / MLOps: Life After “It’s Deployed”

Privacy, Safety, and Compliance as Design Constraints

Evaluation: Offline & Online

Final Thoughts

Relate Posts

Design a Modern Recommendation System

Modeling for Modern Recommendation Systems

各领域的深度学习模型

大模型（LLM）关键技术：从基础到落地

机器学习模型：从传统算法到生成式AI