Design a Modern Recommendation System

type

status

date

summary

From Simple Suggestions to Smart Predictions

While non-personalized recommendations—such as “Top 10 Editor’s Picks” for books or movies—can be useful in certain contexts, they typically underperform personalized feeds on engaging individual users or influencing their behaviors.

The foundational idea behind RecSys stems from everyday human reliance on peer or community recommendations. Early systems mimicked this behavior through collaborative filtering: suggesting items preferred by similar users. Traditionally, RecSys research focused on ranking items based on explicit signals—such as user ratings—or on implicit signals inferred from user behavior, like clicks or purchases.

As online platforms grew, the need to navigate vast catalogs of products and services became increasingly urgent. Recommender systems evolved to employ more sophisticated algorithms that guide users toward relevant choices. This evolution was driven not only by convenience but also by the recognition that too many options can cause decision fatigue and reduce satisfaction. Modern RecSys address this challenge by leveraging diverse types of knowledge and data about users, items, and past interactions. By incorporating contextual information, feedback, and iterative learning, these systems surface items that align with a user’s current context or task and continually refine future suggestions. Many research and production frameworks now support rich, context-aware recommendation capabilities. Recently, large language models (LLMs) have started to augment traditional recommender architectures—for example, in candidate generation, re-ranking, explanation generation, and constraint handling—though they still typically sit on top of the classic multi-stage retrieval-and-ranking funnel, as discussed in my follow-up post.

Building Reliable Recommendation Systems at Scale

Modern recommendation infrastructure transforms machine learning models into reliable, production-grade systems.

It spans the entire stack — from hardware and data pipelines to feature stores, training platforms, serving systems, monitoring, and MLOps automation — providing everything needed to build, deploy, and operate recommendation models at scale.

The full lifecycle can be summarized as:

Data → Features → Modeling → Validation → Registry → Deployment → Inference → Monitoring

Each stage in this lifecycle depends on robust infrastructure to ensure reliability, scalability, and automation — allowing teams to continuously iterate and improve recommendation quality in production.

Data Pipeline: Collection, Processing and Storage

At the heart of every recommender system lies data — the digital traces that capture what users do, what they like, and how items relate to one another. Without this raw log data, the system cannot learn user preferences. The quality and diversity of this data largely determine how accurate and useful recommendations can be.

The pipeline should be designed to effectively capture, process, and stream large volumes of events in real time. Since interactions are generated continuously, the pipeline's main challenge is scalability and latency.

Data Collection (Extraction)

The data pipeline is primarily concerned with the flow of raw user-item interaction events, which are generally categorized as:

Explicit Feedback: Data where a user provides a direct judgment (e.g., product ratings, reviews, or "like"/“dislike” clicks).

Implicit Feedback: Data inferred from user behavior, which is abundant but noisy (e.g., clicks, impressions, views, purchases, searches, and other session activity).

Event schemas are generally logged for any event involving a user’s interaction with content. Typical fields include user_id, item_id, timestamp, session_id, and request context (device, location, referrer, etc.). Different events have specific additional data. For example, view events also log how long the user consumed specific content, e.g. a dwell_time_ms field. Some events may omit an item_id (such as search query events that only log the query text and context).

Event sources — including web frontends, mobile applications, and backend services — acts as producers, publishing event data. These events are streamed into Apache Kafka, which serves as a unified, high-throughput data pipeline that aggregates event streams into Kafka topics, enabling downstream consumers — such as data processing engines (e.g., stream processors and batch processors) — to access and process the data in real time or in batch mode.

Data Processing (Transformation)

This step cleans, deduplicates, aggregates, enriches, and transforms the raw data into standardized schemas, for general data health & utility.

Depending on latency and use-case requirements, the processes are implemented through streaming or batch pipelines.

Streaming processing (Flink/Spark Streaming): handles events continuously in real time, enables up-to-the-moment context awareness but introduces complexity and operational cost.

Batch processing (Glue / Spark): operates on large historical datasets at scheduled intervals, offers high stability and accuracy but produces stale features and delayed predictions.

These processes are increasingly supported by streamhouse architectures that unify lakehouse storage with real-time streaming, providing a single, consistent data plane for both historical analytics and real-time features, while enabling batch retraining for ML models.

Data Storage (Load)

While the streaming processors consume directly from Kafka, a Batch Sink—such as Amazon S3, Google Cloud Storage, or HDFS—continuously exports and archives the same event streams for long-term use.

From there, batch ETL frameworks (e.g., AWS Glue, Apache Spark) process and transform the stored data into data lake or data warehouse for raw and processed data.

Together, these storage layers serve as long-term repositories for raw, processed, and historical data—powering large-scale batch feature computation, deep analytics, and periodic model retraining.

Feature Engineering: Creating Predictive Signals

Feature Engineering is the art and science of transforming raw data and derived statistics into meaningful, quantitative features that the machine learning models can use for prediction. Nowadays, many large recommender systems rely heavily on representation learning (embeddings), end-to-end deep networks, and online learning pipelines rather than purely hand-engineered features.

Feature Types

The features can be broadly categorized into: User features, Item features, Context features and User-item interaction Features :

User Side features(or request level feature) store demographic and behavioral data about users.

Demographic data: attributes such as age, gender, location, and other profile details. However, the use of demographic information is often restricted due to Machine Learning fairness concerns.
User embedding (profiling): a dense vector representation that captures a user’s overall preferences and behavioral patterns.
User behavior features: characteristics based on the user's past actions, aggregated over specific time windows (e.g., 7D, 14D). In modern recommender systems, sequential behavior modeling is now widely used.

Item Side features describe the attributes, content, and relational context of each item (e.g., a product, video, post, or pin). They are typically divided into four categories:

Static Attributes (Metadata Features): basic, relatively unchanging descriptive information of the item, such as item ID, brand, author, seller, publisher, category, tags, age of content, price, size, availability status.
Content Features (Multimodal Inputs): features derived directly from the content itself. They include textual embeddings (titles, descriptions, tags), visual embeddings (images), audio embeddings (songs, podcasts), and video embeddings (frame- or clip-level features), typically extracted using models like BERT, CLIP, CNNs, Wav2Vec or ViT.
Item embedding: a dense vector representation representations of items.
Semantic IDs: dense representations that capture semantic meaning using content information and replace sparse item IDs.
Item Behavioral Features: aggregated signals derived from user–item interactions that reflect an item’s popularity, quality, and temporal dynamics. For example, aggregated interaction counts, click-through rate (CTR), conversion rate (CVR), or dwell-time rate, etc.

User-item interaction Features: signals are specific to a particular user-item pair and provide the most fine-granularity signal, beyond what can be explained by user-only or item-only attributes. They capture how well this user is likely to engage with this item. Examples:

user_item_click_count: Number of times user clicked this item
user_item_view_dwell_time: Avg. dwell time on this item
user_item_last_interaction_time： Time since last interaction with this item

Context features: features capture the environment or situation surrounding the recommendation event such as time, location, device, app version, page position.

Real-World Feature Usage (examples)

Platform	Key Ranking Inputs
YouTube	User & video embeddings + watch history + demographics + device + session context
Pinterest (PinSage)	Graph-based pin embeddings + visual/text features + user profile + session behaviors
Amazon	Product attributes + price + user demographics + purchase history + seasonality/context

Feature Processing

Feature processing transforms raw data — user interactions, item attributes, and contextual signals — into model-ready representations. Beyond traditional data engineering, it adds ML semantics: computing, versioning, and serving the exact signals that models need.

Because recommendation data is inherently heterogeneous (IDs, counts, text, images, etc.), transformations differ by data type: sparse, numeric, categorical, and embedding-based features.

Sparse ID features

Most large-scale recommender models begin with sparse identifiers (e.g., user_id, item_id, post_id, ad_id), which need to be transformed into dense representations:

Embedding lookup tables: convert discrete IDs into dense vectors.

Hash & vocabulary tuning:

Compress immense ID spaces (millions–billions) into fixed-size embedding tables.
Tune hash size to balance memory footprint and collision rate; monitor collision frequency and tail-ID coverage.

Weighted ID lists: for features like “recently viewed items,” weight IDs by recency or frequency before pooling their embeddings (mean/sum/attention pooling).

Numeric features

Numeric features (e.g., click_count, watch_time, price) must be scaled and regularized to stabilize model training.

Normalization: align different scales (e.g., min-max, z-score) to prevent small-range features from vanishing alongside large-range ones.

De-skewing transforms: apply log(x+1) or Box-Cox for heavy-tailed distributions.

Clipping: truncate extreme outliers to maintain robustness.

Categorical features

Categorical features such as country, device_type, or membership_level are represented as discrete tokens.

One-hot / Multi-hot encoding: for low-cardinality categories.

Embeddings for high-cardinality categories: learned in the model or precomputed offline.

Hierarchical grouping: merge rare categories into “other” to reduce sparsity.

Embedding-based features

Embedding-based features represent semantic information about users, items, or content as dense numerical vectors. Unlike traditional engineered features, these are typically learned by deep models trained on large-scale behavioral or multimodal data. To reduce memory and latency costs, high-dimensional embeddings are often projected into compact spaces (e.g., 64–128D) before serving, sometimes using PCA or learned projection layers.

Streaming vs. Batch Feature Processing

Like traditional data pipelines, feature processing operates in two complementary modes: batch and streaming. Both ingest raw user, item, and context data — but differ in freshness, computation cost, and purpose.

Batch feature processing Computes long-term, stable features from historical data and generates the complete, point-in-time-correct training dataset.

Streaming feature processing computes real-time, session-level or short-term features (e.g., rolling windows) for online serving.

Together, these pipelines produce and continuously refresh model-ready features, writing them to the Offline Store and Online Store, respectively.

Feature Store

The Feature Store centralizes the definition, storage, and serving of features.

It uses a Feature Registry to enforce that the features used for offline model training have the exact same logic and semantics as those used for online inference. This consistency is vital for preventing training-serving skew, where differences in feature calculation lead to unexpected model behavior in production.

Once features are computed, the Feature Store functions as the system’s persistent, versioned layer — a unified interface for managing features, and under the hood it consists of an Offline Store for batch training data and an Online Store for real-time serving — both synchronized by shared feature definitions.

Store	Purpose	Latency / Freshness	Storage	Examples
Offline Store	Historical features for model training, validation, and batch inference.	Hours–Days	S3/Delta/Redshift/BigQuery	"user's lifetime purchase count," "item's category popularity over the last 90 days”
Online Store	Latest feature values for real-time inference.	Milliseconds–Minutes	Redis/Dynamo/Cassandra	"item's current trendiness score”

Vector Database

Vector databases are optimized for storing high-dimensional embeddings (e.g., the numerical representation of a product image) and performing efficient similarity search. However, they also introduce challenges around latency, real-time updates, memory–disk trade-offs, and the inherent limitations of approximate nearest neighbor (ANN) methods like stale embeddings, update cost, memory footprint. Running ANN search over millions to billions of item embeddings—while maintaining freshness constraints—is a non-trivial engineering problem.

Training Pipeline

Once features are computed and stored, the training pipeline transforms them into machine learning models. This pipeline operates primarily in offline mode, leveraging the Offline Feature Store and historical labeled data to produce reproducible, versioned models ready for online deployment.

The training pipeline’s main goals are:

Construct a consistent, point-in-time-correct training dataset from offline feature data.

Train, validate, and version models for different stages (recall, pre-ranking, ranking, re-ranking).

Dataset Preparation

Dataset preparation is the process of organizing pre-computed features and interaction labels into structured datasets for model training and evaluation. It involves splitting the data into training, validation, and test sets, constructing positive and negative samples per split, and joining each sample with relevant features while ensuring temporal and user-level consistency.

Data Splitting

Data Splitting involves dividing logged data—such as impressions, clicks, and purchases—into training, validation, and test sets using strategies like random splitting, user-level or item-level partitioning, and time-based segmentation. Among these, temporal splitting is the most realistic for production, as it preserves chronological integrity and prevents leakage from future events.

Each stage of the recommendation pipeline —recall, pre-ranking, ranking, re-ranking—may require distinct splitting logic based on the type of logs it consumes, and care must be taken to maintain point-in-time feature consistency and avoid common leakage pitfalls.

Stage	Typical Splitting Method	Notes
Recall	Time-based or user-based	Focus on matching seen vs unseen items
Pre-Ranking / Ranking	Time-based	Maintain chronological order of exposures
Re-Ranking	Session-based or time-based	Evaluate future sessions or held-out sessions

Sampling

The way we construct positive and negative samples determines what signal the model learns — and how well it generalizes. In recommendation systems, retrieval (recall) and ranking models differ in how they define and sample training pairs.

Retrieval Models

Positive samples: (user, item) pairs that were exposed and clicked (or otherwise interacted with).

Negative samples

Random negatives (popularity-weighted): since each user interacts with only a tiny fraction of the entire item corpus, negatives are sampled from the global item pool with probability proportional to item popularity raised to the 0.75 power. This down-weights extremely popular items, preventing the model from overfitting to high-frequency content and ignoring long-tail items.
In-batch negatives: in a mini-batch of N positive (user, item) pairs, each user treats the other N−1 items as negatives without explicit negative sampling.
Hard negatives: items that are retrieved via ANN search and filtered at the pre-ranking or ranking stage.

Pre-ranking / Ranking models

Ranking is typically multi-task, learning to score items by their likelihood of various engagement types — such as clicks, likes, saves, shares, or purchases. For each engagement type:

Positive samples: items that were exposed and resulted in engagement.

Negative samples: items that were exposed but not engaged.

Re-ranking models

The need for labeled samples depends on the model type.

For diversity-based re-ranking models (e.g., MMR, DPP, MGS), we typically do not require explicitly labeled samples — these methods operate deterministically or unsupervised, optimizing list diversity based on similarity or coverage metrics.

In contrast, Contextual Bandit models require labeled interaction data to learn from rewards.

Positive sample / reward: a measurable success signal following exposure (e.g., a click, purchase, or long dwell time).
Negative sample / penalty: the absence of a positive reward (e.g., no click or a quick skip).

Feature assembly

Feature assembly is the process of combining pre-computed features with the corresponding samples in the training, validation, or test datasets to create ML-ready examples. For each user–item interaction, relevant features—such as user demographics, item metadata, embeddings, and contextual signals—are retrieved from the feature store or other storage systems and form a complete feature vector ready for model training and evaluation.

Crucially, feature assembly must respect point-in-time correctness, ensuring that only information available up to the timestamp of the interaction is used, which prevents leakage of future data into the model.

Model Training

Modern recommender systems rely on distributed training frameworks such as TensorFlow，PyTorch DDP, Horovod, or Ray Train to efficiently parallelize computation across multiple GPUs or nodes. These platforms support both data-parallel and model-parallel strategies to handle billions of interactions and massive embedding tables.

For experimentation, hyperparameter tuning and model search are orchestrated through tools like Ray Tune, SageMaker, or Vertex AI, enabling automated sweeps, early stopping, and resource-efficient optimization. Training workflows for different stages — recall, pre-ranking, ranking, and re-ranking — are coordinated by Airflow or Kubeflow Pipelines, enabling multi-stage retraining, model versioning, and scheduled production updates.

Together, these systems form a scalable, reproducible, and fault-tolerant training environment that accelerates iteration from prototype to production-grade recommender models.

The following table summarizes the models used in the modern multi-stage recommendation systems, along with the infrastructure requirements. For readers interested in the modeling side, see my follow-up post.

Stage	Common Models	Infrastructure
Recall	Two-Tower (DSSM, YouTube DNN), LightGCN, PinSage	For very large catalogs and user bases (billions of interactions + huge embedding tables), requires distributed GPU training (TensorFlow / PyTorch DDP / Horovod), large parameter servers or sharded embedding storage, and efficient data pipelines.
Pre-Ranking	Wide & Deep, DeepFM, shallow DNN	Can run on single GPU or small clusters; less demanding distributed setup.
Ranking	DCN, DIN/DIEN, Transformer4Rec, MMoE, PLE	Needs multi-GPU distributed training, mixed precision, and automated hyperparameter tuning (Ray Tune, SageMaker).
Re-Ranking	LambdaMART, Contextual Bandit, GraphRL, DPP/MMR	Often CPU/GPU hybrid, may require custom trainers for sequence or policy learning; frequently retrained online.

Recommendation systems must also handle challenges such as cold-start for new users or items, context-aware recommendations that depend on factors like time, device, or location, and feedback loops where user interactions (such as clicks) shape future behavior. Modern architectures increasingly rely on sequential and contextual models, often powered by streaming data.

Model Evaluation

Evaluation ensures that trained models not only fit historical data but also generalize effectively to real user behavior. It operates at two levels: offline validation and online experimentation.

Offline evaluation uses historical logs to measure metrics such as hit rate, MRR, NDCG, precision@K, and recall@K, and coverage, verifying that the model ranks relevant items higher than irrelevant ones.

Modern recommender systems place significant emphasis on online metrics such as click-through, conversion, and engagement; on causal measurement through A/B testing; and on monitoring bias, fairness, and drift. The goal is to ensure reliable, high-quality recommendations that align with business objectives, while meeting strict latency and freshness requirements.

Together, these stages close the feedback loop — offline metrics provide fast iteration and sanity checks, while online experiments validate true business impact and user experience quality before full rollout.

Deployment Pipeline

The deployment pipeline automates the process of promoting offline validated models from the registry into production serving environments.

The deployment pipeline is often triggered after model training and evaluation, either manually or via orchestration tools like:

Orchestration tools：Airflow / Kubeflow Pipelines for full DAG automation

CI/CD tools: GitHub Actions, Jenkins, GitLab CI

MLOps platforms: SageMaker Pipelines, Vertex AI, Azure ML

Model packaging & registration

Once a model passes offline evaluation, it is packaged and registered for deployment.

Model Packaging (Preparation)

Model packaging is the process of creating a single, portable artifact (such as ONNX, TorchScript, or TensorFlow SavedModel) that bundles the trained model weights, architecture, preprocessing logic (including feature schema and fitted scalers/encoders), and version metadata. This ensures the model is reliably reproducible and deployable across various serving environments.

Model Registration (Cataloging)

Model Registration transforms a simple code artifact into a governed, deployable asset. It stores the packaged model in a central, versioned repository—the Model Registry (such as MLflow, SageMaker Model Registry, or Vertex AI Model Registry). The registry tracks model version history, approval status (e.g., whether a model is ready for production), model metadata (metrics, lineage, feature definitions), and environment details. Registration provides a single source of truth for promotion and rollback, enabling automated CI/CD pipelines to safely deploy models to production and ensuring full traceability across the ML lifecycle.

Model Deployment & Serving

Once a model is packaged and registered, the pipeline retrieves the approved version, validates compatibility with the target environment, and deploys it to online or batch serving systems.

Choose serving infrastructure based on inference type:

Batch inference: Spark, Airflow, SageMaker Batch are used for offline or large-scale predictions where latency is not critical.

Real-time inference: KServe, SageMaker Endpoints, TensorFlow Serving, TorchServe, FastAPI, Triton Inference Server are used for online predictions where low latency is required.

Edge deployment: ONNX, TensorRT, CoreML are used for on-device inference, often with hardware optimization.

Deployment decisions also account for critical factors such as model versioning, online versus offline training, real-time feature updates, retrieval and ranking separation, caching, load balancing, and observability, ensuring safe, efficient, and reliable operation in production.

To safely release new models, production traffic is rolled out using strategies such as canary deployments, blue-green releases, or A/B testing.

Online Inference

Online inference is running the model on a live request to generate personalized recommendations.

When a request arrives, the system retrieves fresh user, item, and contextual features from the online feature store (e.g., Redis, DynamoDB, or Feast) and in parallel fetches embeddings and cached candidates from in-memory caches or vector databases and feeds them into the serving model (e.g., Triton Inference Server, TensorFlow Serving, or TorchServe).

The model computes engagement probabilities or relevance scores within strict latency budgets — typically tens of milliseconds — and returns ranked candidates to the application layer.

Online inference infrastructure is optimized for low latency, high throughput, and feature consistency, enabling real-time personalization at scale.

Monitoring & observability

Monitoring and observability ensure that deployed recommendation models continue to perform reliably under real-world conditions.

Once in production, the system continuously tracks model health, data integrity, and serving performance across multiple layers — including QPS, p50/p95/p99 latency, throughput, error rates, and prediction distributions.

Feature and label drift detection mechanisms compare online inputs against historical training data to detect schema drift, distribution shifts early, while metric tracking (e.g., CTR, CVR, AUC degradation) provides ongoing validation of model quality.

Logs and traces from serving systems are collected via Prometheus, Grafana, OpenTelemetry, or EvidentlyAI to enable real-time dashboards and alerting.

These signals feed into automated retraining or rollback pipelines, ensuring the recommendation system remains accurate, responsive, and resilient as data and user behavior evolve.

Final Thoughts

Modern recommendation systems have evolved into complex, end-to-end machine learning ecosystems, where success depends as much on infrastructure and data reliability as on modeling innovation. As models grow more complex and data pipelines more dynamic, the ability to reason across stages and automate intelligently becomes a competitive advantage. Ultimately, the goal is to build systems that learn, adapt, evolve and deliver value continuously.