type
status
date
summary
tags
category
icon
password
featured
freq
difficulty
Modern RecSys Multi-Stage Pipeline
Major tech companies such as Google, Meta, and Amazon typically employ a multi-stage pipeline as the standard architecture for large-scale, low-latency recommendation systems. This design efficiently filters a catalog containing millions or even billions of items down to a final set of results that are both highly relevant and diverse.
Large language models (LLMs) are increasingly augmenting these systems—for example, in candidate generation, re-ranking, explanation generation, and constraint handling—but they typically layer onto the classic retrieval-and-ranking funnel rather than replace it.
This post focuses on an overview of the Multi-Stage Pipeline—and the models used at each stage— as it is widely deployed in production today.
Overview
The pipeline is typically composed of three to four distinct stages:
- 召回(Retrieval): retrieves a small set of relevant candidates from the entire catalog (millions to billions) of items. This stage prioritizes high recall (finding all potentially relevant items) and speed.
- 粗排(Pre-ranking) : an optional but common intermediate stage designed for filtering before the most computationally expensive stage (Deep Ranking). It applies heuristics, business rules, or shallow models to estimate rough relevance scores.
- 精排(Ranking): uses deep models (e.g., DeepFM, DIN, Transformer4Rec) to accurately score each candidate. Here, multi-objective ranking balances multiple business goals such as CTR, CVR, watch time, and retention. It uses the richest set of features and the most complex models to maximize precision.
- 重排(Re-ranking) : refines the ranked list after scoring, often for soft constraints or optimization goals such as diversity and freshness.
Retrieval (召回): Candidate Generation
The Retrieval stage’s job is to efficiently narrow billions of items to a few hundred relevant candidates for downstream ranking. This stage has transited from classical heuristics to scalable neural architectures.
Traditional Methods estimate user–item relevance (or preference) based on similarity.
- Collaborative Filtering (CF): Based on user-item interaction matrices, using techniques like matrix factorization (e.g., ALS, BPR).
- Content-Based Filtering (CBF): Leveraging item metadata (e.g., genre, category, tags) to match user profiles.
They were interpretable and effective in small or moderately sized systems but struggle with scalability, sparsity, and cold-start challenges — motivating modern deep Retrieval architectures.
Deep Retrieval Models combine collaborative, content signals and even sequential modeling through learned embeddings.
- Two-Tower (Dual Encoder) Models: Users and items are encoded separately into a shared latent space using architectures like DSSM and YouTube DNN. Training typically uses contrastive loss or sampled softmax.
- Graph-Based Models: Models such as LightGCN and PinSage use the user–item interaction graph to propagate signals across multiple hops. LightGCN learns both user and item embeddings, whereas PinSage focuses on item embeddings. In PinSage, user representations can be constructed implicitly by aggregating the embeddings of items a user has interacted with.
These models explicitly produce embeddings for users and items, making it easy to perform vector-based retrieval (via ANN search like FAISS, ScaNN, or Milvus). And they scales extremely well — you can precompute user embeddings periodically and item embeddings offline.
Multi-Channel Retrieval has become the industry-standard approach for the retrieval stage. Instead of relying on a single retrieval model, modern systems employ multiple independent candidate generators—or “channels”—each designed to capture different aspects of user intent and item characteristics. This strategy ensures that the downstream ranking model receives a rich yet manageable pool of high-quality candidates that represent multiple user intents and contextual signals.
For example, deep matching channels (e.g., Two-Tower, LightGCN, PinSage) model high-level semantic relationships and generalize well across domains; content-based channels retrieve cold-start or new items using multimodal embeddings derived from text, image, or metadata; popularity-based channels ensure baseline engagement and user satisfaction by including trending or category-level popular items; and collaborative filtering channels (e.g., Matrix Factorization, Item-CF) model classical user–item similarity patterns.
Each channel typically produces its own candidate subset. These subsets are then merged, deduplicated, and forwarded to the pre-ranking or ranking stage for fine-grained scoring.
Pre-ranking (粗排): Filtering
The Pre-Ranking Stage serves as the bridge between large-scale retrieval and deep ranking.
It receives a few hundred candidates from the Retrieval stage and quickly filters them down to a few dozen high-potential items.
Before any model inference, the pre-ranking step typically applies lightweight heuristics or business rules to clean and constrain the candidate pool by 1) removing out-of-stock or already-seen items, 2) enforcing age, region, or policy restrictions, and 3) applying simple score thresholds or rule-based filters for compliance and quality control.
Then, a lightweight scoring model to estimate rough relevance scores:
- Gradient Boosted Trees (e.g., GBDT, XGBoost)
- Shallow Neural Networks: Small multi-layer perceptrons (MLPs)
- Two-tower or Three-Tower Models (user–item–context)
Pre-ranking is usually trained as a binary classification task, using log loss, binary cross-entropy, or pairwise ranking losses to improve relative ranking quality.
Ranking (精排): Fine-Grained Scoring
The Ranking Stage is the core decision layer of a recommender system.
It takes the top candidates (usually dozens) from pre-ranking and performs fine-grained, personalized scoring using the richest set of user, item, and context features available.
The ranking model predicts the probability of engagement — such as click-through rate (CTR), conversion rate (CVR), or watch time — and orders items accordingly.
To provide highly personalized, accurate, and context-aware item ordering, advanced deep learning architectures are widely adopted to capture complex user-item interactions, sequential behaviors, and multiple prediction objectives.
Several families of models dominate modern production systems:
- Cross-Feature Models
These models explicitly learn both low- and high-order feature interactions. They combine efficiency with the ability to model complex feature interactions directly.
Example: Wide & Deep, DCN, DCN v2, DeepFM.
- Attention & Sequence Models
These models capture evolving user interests by applying attention mechanisms over recent user behavior sequences to dynamically match candidate items. They are highly adoption for e-commerce and feed ranking (e.g., Taobao, TikTok).
Examples: Deep Interest Network (DIN), Deep Interest Evolution Network (DIEN), Deep Session Interest Network (DSIN).
- Multi-Task Learning Models
Designed to jointly optimize multiple related objectives such as CTR, CVR, and retention by sharing representations while allowing task-specific adaptations.
It uses a Shared Bottom to learn a general representation of the user-item relationship, and the outputs of the shared bottom are fed into multiple task-specific prediction heads (or Top MLPs), one for each objective.
The model is trained by minimizing a weighted sum of the loss functions from all the prediction heads. For example:
Examples: Multi-gate Mixture-of-Experts (MMoE), Entire Space Multi-Task Model (ESMM), Progressive Layered Extraction (PLE).
- Transformer-Based Models
These models leverage self-attention mechanisms to capture long-term sequential dependencies and fuse multimodal content embeddings (text, video, images). They are increasingly used for session-based recommendations and multimodal fusion to enhance personalization and content understanding.
Examples: Transformer4Rec, P5, UniRec.
These architectures are trained on large-scale feedback data using cross-entropy and multi-task losses.
The evolution of ranking models mirrors the broader shift in AI: from hand-crafted interactions → learned interactions → sequence and graph modeling → multimodal and generative reasoning. Today’s ranking systems blend deep representation learning, attention, and multi-task optimization to deliver both accuracy and personalization at scale.
Re-Ranking (重排) : Diversity, Exploration, and Control
This is the final stage that adjust the final ranked list to achieve a better balance between relevance, diversity, freshness, and business constraints before the final display to users.
- Diversity: Avoid recommending too many similar items consecutively (e.g., not all action movies), promoting variety across genres, categories, or formats.
- Fairness / Exposure Control: Ensure balanced visibility for different creators, sellers, or item categories, and enforce policy constraints like exposure quotas or repetition limits.
- Exploration: Introduce long-tail or new items to gather user feedback and improve future recommendations.
- Freshness Boost: Temporarily prioritize recently uploaded or trending items to counteract popularity bias.
- Session Optimization / Satisfaction: Maximize overall user engagement, not just per-item CTR.
Re-Ranking methods can be broadly grouped by their level of learning sophistication and optimization scope.
- Rule-Based constraints: Applies deterministic filtering and quota rules to enforce business or policy constraints (e.g., category frequency, exposure limits). They are widely used in production for simplicity, stability, and interpretability.
- Linear or MLP re-scoring: Utilizes simple learning models (such as linear models or small MLPs) that take the original ranking scores plus contextual or novelty features to re-score items considering position bias, freshness, or saturation effects. There are core industry method balancing efficiency and effectiveness.
- Contextual Bandits exploration: Balances Exploitation (sticking to the best-known options with high predicted CTR) and Exploration (trying out uncertain or under-exposed items to improve long-term learning), using algorithms such as LinUCB, Thompson Sampling, EXP3, and Neural Bandits. Widely deployed in feeds and ads systems, and used by YouTube, TikTok, and Taobao for adaptive content rotation and freshness.
- Diversity-Aware post-processing: Enforces diversity constraints to avoid redundant or overly similar content in the final list. Common models include Maximal Marginal Relevance (MMR), Determinantal Point Process (DPP), and Modified Gram-Schmidt (MGS). Commonly used as a fast post-processing step for feeds and blended content sources.
In addition, advanced methods aim to reason about future states, inter-item dependencies, and cumulative reward, like Listwise Learning (e.g., LambdaMART, ListNet), Sequential RL (e.g., DQN-Rank, SlateQ), and GraphRL are used in limited or hybrid deployments due to high computational complexity and serving cost.
Supporting Infrastructure
A modern RecSys pipeline relies heavily on high-performance infrastructure to handle vast data volumes, complex models, and serve recommendations in real time. This infrastructure includes scalable data storage, efficient data processing pipelines, low-latency feature computation and serving layers, distributed training clusters, and real-time serving systems capable of delivering millions of personalized recommendations per second.
- Data Ingestion & Logging: This layer captures user interactions (clicks, views, purchases), item updates, and contextual signals in real time. Tools like Kafka, Flink and Spark are used to stream and process these events efficiently for downstream consumption.
- Feature Store: A centralized, low-latency service that stores and serves features used in both training and inference. It supports batch-computed features (e.g., 7-day click counts) and real-time features (e.g., current session activity), ensuring consistency across environments.
- Model Training Pipeline: This component handles offline training of Retrieval, ranking, and re-ranking models using frameworks like TensorFlow or PyTorch. It includes distributed training, and experiment tracking for reproducibility and performance tuning.
- Serving Infrastructure: Handles real-time inference and delivery of recommendations with low latency. It includes model servers, caching layers (e.g., Redis), and optimizations like batching and quantization to meet production SLAs.
- Monitoring & Evaluation: Tracks metrics such as CTR, watch time, and retention, and supports A/B testing to validate model changes and new features across all stages of the pipeline on live traffic. It also includes alerting systems to detect anomalies or regressions in real time.
Final Thoughts
Modern recommendation systems have evolved from manual feature engineering to deep, multi-objective, and generative architectures. The next generation of recommender systems will be:
- Unified: integrating retrieval, ranking, and generation under a single LLM framework.
- Multimodal: understanding text, image, video, and audio jointly.
- Generative: creating personalized items, not just recommending existing ones.
- Privacy-aware: combining federated and on-device learning with LLM reasoning.
From Retrieval to re-ranking, from personalization to privacy, they are now and will continue be the intelligent backbone of every digital experience.
References
- (2024). Recommender systems algorithm selection for ranking prediction on implicit feedback datasets (Preprint). https://arxiv.org/abs/2409.05461v1 arXiv
- (2024). A comparative study on recommendation algorithms: Online and offline evaluations on a large-scale recommender system (Preprint). https://arxiv.org/abs/2411.01354v1 arXiv
- (2025). A hybrid cross-stage coordination pre-ranking model for online recommendation systems (Preprint). https://doi.org/10.48550/arXiv.2502.10284 arXiv
- Ma, L., Padmanabhan, A., Ganesh, A., Tang, S., Chen, J., Li, X., Morishetti, L., Nag, K., Patel, M., Cho, J., Kumar, S., & Achan, K. (2024). Improving sequential recommender systems with online and in-store user behavior (Preprint). https://arxiv.org/abs/2412.02122v1 arXiv
- Zhao, X., Wang, M., Zhao, X., Li, J., Zhou, S., Yin, D., & Guo, R. (2023, December 21). Embedding in recommender systems: A survey (Version v2). https://doi.org/10.48550/arXiv.2310.18608 arXiv
- He, Z., Liu, W., Guo, W., Qin, J., Zhang, Y., Hu, Y., & Tang, R. (2023, February 22). A survey on user behavior modeling in recommender systems. https://doi.org/10.48550/arXiv.2302.11087 arXiv
- Pancha, N., Zhai, A., Leskovec, J., & Rosenberg, C. (2022). PinnerFormer: Sequence modeling for user representation at Pinterest (arXiv:2205.04507). https://doi.org/10.48550/arXiv.2205.04507
- Coding Monkey. (n.d.). 那些年,我们追过的 Feature | Features in recommendation system. Retrieved from https://pyemma.github.io/Features-in-Recommendation-System/ Coding Monkey
- Ricci, F., Rokach, L., & Shapira, B. (Eds.). (2022). Recommender Systems Handbook (3rd ed.). Springer.
- Author:Fan Luo
- URL:https://fanluo.me/article/modeling-for-modern-recommendation-systems
- Copyright:All articles in this blog adopt BY-NC-SA agreement. Please indicate the source!
上一篇
Design a Modern Recommendation System
下一篇
The ML Factory: Building Production ML Systems
