Demystifying Agentic Search Engines

type

status

date

summary

Search Engine Basics: Crawling, Indexing, Serving

Before diving into Agentic search, I’ll briefly review the basics of how a search engine operates: crawling the web to discover pages, indexing them for efficient lookup, and serving results when a user submits a query. “Serving” refers to all the work done at query time—and this is exactly the area that has been transformed most dramatically by LLMs. This post will focus on that query-time layer.

Crawling discovers what exists on the web

There is no global registry of URLs, so software known as crawlers such as Googlebot, continuously traverses the web. It discovers new URLs in several ways: by following links on pages that have already been crawled, by parsing hub or category pages that link to new content, and by reading sitemaps or URL lists that site owners submit. Once a URL is known, the crawler may fetch it; crawling decisions are made algorithmically, taking into account how frequently a site’s content changes, how many URLs it exposes, how responsive the server is, and, importantly, constraints like robots.txt and login walls. During crawling, search engines render pages using a headless Chrome-like environment and executes JavaScript, because much modern content is injected dynamically. Modern Search Systems’ infrastructure run many specialized crawlers optimized for different content types and sources, and they maintain multiple indexes beyond just HTML web pages.

In addition to open-web crawling, they also ingest data through partnerships and structured feeds: public transit agencies supply schedules, data providers publish structured datasets, publishers and merchants push product catalogs, and libraries and digitization projects contribute scanned books and metadata—extending their coverage far beyond.

Indexing organizes the Web's Information

Indexing begins after a page has been crawled and rendered. The search system then parses the content—text, titles, headings, alt attributes, images, videos, and structured data—to determine what the page is about, which language it uses, whether it targets a particular region, and how usable it is. A key part of indexing is canonicalization. Because many URLs can lead to the same or nearly identical content (www vs. non-www, tracking parameters, mobile/desktop variants, localized versions), the system clusters these duplicates and selects a single canonical URL to represent the group. Signals are stored at this canonical level, while alternate versions remain available for context-specific serving, such as mobile or language variants.

Not every processed page is indexed; low‑quality content, pages blocked by robots rules, or pages that are hard to parse reliably may be skipped. Those that are kept are stored in a distributed search index spanning hundreds of billions of documents, built from multiple internal structures such as inverted indexes, vector indexes, and structured databases. Each search vertical (such as web, images, videos, books, and local listings) typically maintains its own indexes over its specific corpus. In parallel, the system maintains a Knowledge Graph, a structured network of entities and facts that functions as an internal encyclopedia.

Crawling and indexing are continuous processes because the web is constantly changing. Systems learn which pages and feeds update frequently and revisit them more often, while others are recrawled only occasionally.

Serving delivers search results upon query

“Serving” is the umbrella term for everything that happens at query time, and it’s also where Agentic search systems differ most from classical search. The rest of this post focuses on that query-time stage.

Query Understanding

When a user submits a query, the system’s first responsibility is to understand what the user actually wants. Query understanding transforms raw, ambiguous, or incomplete user input into structured signals that can drive retrieval, ranking, and evidence selection. In classical search, this stage handled tasks like spell correction, synonym expansion, intent classification, entity recognition, and query rewriting. These remain essential foundations, enabling the engine to correct inputs (“itlian restaurat → Italian restaurant”), expand them (“Italian restaurant → trattoria, pizza, pasta”), classify intent (informational vs. local vs. transactional), and detect key entities (people, locations, dates, brands) needed for routing to the right verticals.

Agentic search extends this dramatically. Users now phrase queries as natural-language instructions, open-ended tasks, or multi-step goals (“plan my trip,” “explain this code,” “compare these universities”), often with images or voice. Modern systems call LLMs to interpret intent, extract constraints, infer latent goals, and translate the original query into an internal representation that supports tool use, retrieval planning, and stepwise reasoning. The result is an actionable task graph: structured, interpretable, and ready for retrieval across web, APIs, vertical indexes, and real-time sources. Because the interface is conversational, query understanding extends across turns—refining the intent with follow-ups (“make it cheaper,” “add vegetarian options,” “avoid long lines”).

Query Fan-Out

A defining capability of agentic search is query fan-out, where the LLM decomposes a user’s request into parallel sub-queries tailored to different information sources. For a query like “What can a group of friends do in Nashville this weekend?”, a traditional search engine returns a heterogeneous list of pages. The system interprets the underlying tasks: group size, timing, budget, food, nightlife, events, local availability. It then auto-generates structured sub-queries such as “Nashville group-friendly restaurants open tonight,” “live music venues Saturday,” “family-friendly daytime activities,” or “event tickets under $50.” These are routed to local, maps, events, activities, shopping, and real-time verticals.

Handling Query Complexity

Not every query requires deep decomposition or heavyweight models. Agentic search systems classify query complexity and allocate resources accordingly. Simple factual queries (“temperature,” “who is X”) use small models and a single retrieval call for instantaneous responses. Moderately complex queries (“compare two universities’ CS programs”) use a more capable model with a handful of well-chosen searches and may require a few seconds of reasoning. For high-stakes or high-complexity tasks—financial decisions, safety research, product due diligence, regulatory comparisons— the system escalates to deep search mode, where dozens or hundreds of sub-queries are issued.

This adaptive behavior uses model selection (routing to small vs. large models), depth estimation, and budget scheduling, allocating more compute, more retrieval, or more reasoning only when the task demands it.

Retrieval

The retrieval stage gather a wide pool of potentially useful information from many different sources. It is not a single lookup against a monolithic web index, but an orchestrated process across multiple heterogeneous sources, each optimized for different content types, freshness requirements, and trust expectations.

For the open web, retrieval combines hybrid lexical and vector search. A traditional inverted index surfaces exact matches, entities, and long-tail facts, while a dense vector index (ANN engines such as HNSW, ScaNN, or Faiss) captures paraphrases, descriptions, and conceptual similarity.

Beyond the general web index, the engines consult specialized vertical indexes—news, products, maps, videos, academic literature, reviews, or structured enterprise knowledge. These vertical indexes rely on domain schemas, metadata filters, geospatial lookup, timestamps, and sometimes small embedding-based similarity models for refinement. They are typically cheaper and more precise than open-web retrieval.

Retrieval also draws from structured knowledge sources, such as the Knowledge Graph. They are entity and attribute stores. Queries here are answered via graph or key–value lookups, supplying canonical facts and relationships that help ground and disambiguate what the user is asking.

On top of internal indexes, modern systems rely heavily on external APIs as retrieval sources. This includes both search APIs (for example, Bing Web Search results used by Perplexity, specialized news or paper search APIs, marketplace search APIs) and real-time data APIs (stock prices, weather, flight status, sports scores, alerts, transit data). The responses from these APIs—documents, snippets, or structured JSON—are treated as additional candidates and flow into the same ranking pipeline as internally indexed content.

Across all these retrieval channels, the systems apply safety and eligibility constraints— blocking unsafe domains, routing queries to appropriate verticals, and filtering policy-violating or low-quality sources. The result is a heterogeneous collection of web passages, vertical results, structured entities, API responses, and cached session evidence. Retrieval’s job is breadth with guardrails—so that downstream ranking and LLM reasoning operate on a wide but policy-safe candidate pool.

Re-Ranking

In agentic search engines, ranking no longer exists to produce a list of “top 10 results.” Instead, ranking (or re-ranking) is about curating the best possible evidence set that an LLM can trust, synthesize, and ground its answer in.

The process begins with an optional coarse filtering to remove obvious redundancies, weak matches, or noisy content. This pass applies cheap heuristics—fast semantic checks, deduplication, shallow trust filters, and minimal cross-source consistency tests. Retrieval has already enforced domain, safety, and language constraints, so coarse filtering simply harmonizes heterogeneous sources—web pages, structured entries, product attributes, or news articles—into a consistent candidate pool and cleans the pool before deeper models act.

Followed by neural re-ranking using powerful ML or LLM models:

Neural Cross-Encoder Rerankers (such as BAAI/bge-reranker, monoT5, RankT5, or RankLLaMA) perform deep query–passage comparison. They identify the passages with the strongest semantic alignment, highest specificity, and clearest topical relevance, producing a refined shortlist.

LLM-based Rerankers then evaluate these candidates for answerability, factual grounding, internal consistency, and clarity. These models capture subtleties that cross-encoders cannot—spotting contradictions, elevating richer evidence, and prioritizing passages that satisfy the user’s true intent rather than merely matching surface terms.

Below is the comparison of the Traditional Ranking and Ranking in Agentic Search Engines.

Aspect	Traditional Ranking (pre-2022, classic Google/Bing)	Modern AI-Heavy Ranking (2024–2025, SGE, Perplexity, Copilot, ChatGPT Search, etc.)
Main goal	Produce the best ordered list of 10 blue links	Produce the best small set of documents (usually 5–30) that an LLM can reliably synthesize into a correct, fluent answer
How many documents are ranked	Top 10 for the SERP (sometimes top 1,000 internally)	Top 100–500 for initial retrieval → re-ranked to top 5–30 for the LLM
Primary scoring model	Hand-tuned formula + lightweight ML (BM25 + PageRank + 200–500 signals)	LLM-driven re-ranking (cross-encoder, listwise LLM scorer, or direct prompt-based ranking)
Who/what does the final sort?	Ranking algorithm (Learning-to-Rank models like LambdaMART, RankBrain, PageRank, BERT-based cross-encoder)	An LLM (Gemini, GPT-4o, Claude-3.5, Llama-3.1-70B, etc.) that reads query + document
Key signals	Keywords, backlinks, click rates, freshness, context	All the old ones + semantic coherence with the LLM’s generated answer, factual consistency, citation-worthiness, low hallucination risk when summarized
Output	Final ordered list that is shown directly to the user	Ordered (or scored) shortlist that is fed to the generation/inference stage
Typical name	“Ranking” or “Learning to Rank”	“Re-ranking”, “LLM re-ranker”, “second-stage ranker”, “RAG ranker”
Latency budget	Must be < 100 ms	First-stage < 100 ms, re-ranking can be 300–1,500 ms because it happens only for the tiny shortlist

Surrounding the core reranking modules are several supporting layers—safety filters, vertical adjustments, consistency checks, and coverage logic. These components regulate eligibility, suppress harmful or low-quality material, ensure agreement among sources, and guarantee that multi-part queries receive balanced coverage.

The result of this layered process is a compact, high-quality evidence set that the LLM can reliably ground its answer on.

Agentic Reasoning

Modern AI search experiences incorporate agentic reasoning— interpret, decompose, and research complex queries.

When users ask broad or multi-faceted questions—such as “plan a weekend in Nashville” or “compare quiet hybrid cars for city driving”—the system:

Interprets and decomposes intent

Breaks the question into sub-queries (“kid-friendly options,” “restaurants,” “hotels”), guided by LLM planning.

Retrieves from multiple sources

Uses web search, vector retrieval, local results, product data, knowledge graphs, and high-quality publishers.

Cross-checks and reconciles information

Ensures consistency, removes duplicates, merges multi-source facts, and fills gaps using follow-up retrieval.

Synthesizes structured, grounded results

Produces coherent overviews, comparison tables, short guides, or itineraries—always with citation links or source attributions.

Generation

The generation layer leverages LLMs to synthesize a final answer.

Search-augmented LLM

Large language models used in modern search engines (e.g., Perplexity Sonar, Google Gemini for Search, Microsoft GraphRAG) are fine-tuned on retrieval-augmented data to prioritize retrieved evidence over internal parametric knowledge, reduce hallucinations, and generate factually grounded answers with citations. While the underlying model inherits strong reasoning, summarization, and planning abilities, search systems fine-tune it further to operate reliably inside a grounding-heavy environment. The fine-tuning teaches the model to structure responses around user intent and retrieved context, incorporate search-specific signals such as freshness from live web results, use knowledge graphs and entity metadata for accurate understanding, and optimize info tasks for quality/explainability like citing/linking). This specialization enables responses that are accurate, up-to-date, and anchored in verifiable sources — the core requirement for reliable agentic search.

Fusion-Based Generation

Modern systems use Fusion-in-Decoder and other multi-vector attention mechanisms, enabling the decoder to attend over many passages at once. Instead of relying on a single “best” snippet, the model integrates multiple evidence spans, resolving conflicts, enriching context, and preserving detail. This produces answers that reflect the full breadth of retrieved information rather than a single source’s framing.

Citation-Aware Generation

During generation, the model attaches inline citations or source IDs to grounded claims. These citations come directly from the reranked evidence pack—ensuring traceability and minimizing hallucination. The system may also insert links, highlight supporting passages, or expose a side panel of sources, making the reasoning behind the answer transparent.

Faithfulness Guards

Because generation is the stage where hallucinations can emerge, agentic search employs faithfulness guards: attribution-constrained decoding, contrastive decoding, disallowed-content detectors, and domain-specific safety policies for sensitive topics. If the request is sensitive or task-critical, the system may switch to a more conservative model variant tuned for rigor over style.

In agentic search, generation is therefore the interpretation layer—the point where curated evidence, structured reasoning, and safety controls come together to produce a reliable, grounded, and actionable answer.

Multimodal Capabilities

Multimodal extends agentic search engines to real-world, supporting a wide range of input modalities—images, voice, video, screenshots, and continuous dialogue—allowing people to interact with search in more natural and contextual ways.

Frontier models such as Gemini, GPT, and Claude are natively multimodal: they can interpret text, images, audio, and video within the same reasoning context. This enables the search engine to understand complex situations (“What’s wrong with this appliance?”), constrained tasks (“Find me cheaper versions of this sweater”), or contextual queries (“Plan a dinner based on what’s in my fridge”).

These capabilities also drive the growth of visual-first search experiences across phones, browsers, and AR interfaces.

Visual Search

Visual search usage has grown rapidly—especially among younger users—through features like Google Lens, Circle to Search, and image-driven queries in Perplexity and Bing Copilot. Users can ask questions by pointing a camera, uploading a screenshot, or cropping part of an image. Common use cases include:

Shopping: Identify clothing, furniture, or decor; find similar or “more colorful/simpler” alternatives; extract product details from screenshots.

Homework and learning: Take a picture of a math or science problem to receive step-by-step explanations rather than just answers.

World identification: Recognize plants, animals, buildings, artworks, books, landmarks, or signage.

Voice search and Real-Time Dialogue

Voice-first interaction is becoming central to mobile and hands-free search. Modern engines support real-time conversational search, where the model listens continuously, interprets follow-up questions, and performs agentic fan-out behind the scenes. Users can ask for:

live sports scores or stock updates,

route changes while driving,

restaurant suggestions while walking,

clarification or refinement of previous results.

These sessions persist across devices: a search started in the car can continue later on a phone or laptop, with the agent retaining context, citations, and intermediate plans.

Video search

The next frontier is live video search and real-time visual dialogue. Users will be able to stream a cooking setup, a mechanical issue, or a DIY problem, and the agent will analyze video frames continuously, track objects, understand actions, and overlay grounded guidance. For example:

troubleshooting a car engine while pointing a camera at components,

receiving cooking steps while showing the stovetop,

scanning a room and asking for design or shopping suggestions,

identifying hazards, steps, or missing tools in a live workflow.

The agent integrates what it sees with external search, structured knowledge, and vertical APIs, producing immediate, contextual guidance with links for deeper exploration.

Personalization

Personalization in modern search is shifting from generic suggestions to contextually tailored experiences that quietly anticipate user needs.

By incorporating behavioral patterns, location cues, and temporal trends, the search systems can surface results that feel naturally aligned with what the user is likely seeking—improving engagement and task success by 20–30% while prioritizing privacy. It does not affect objective queries for sure. Math, factual lookups, and encyclopedic questions remain the same for everyone. But preference-shaped queries—“Where should we go for a date dinner?”, “What fall shoes match my style?”—benefit from individualized context, producing results aligned with a user’s tastes, habits, and constraints.

It can be lightweight, focusing on subtle, non-intrusive enhancements that build on user history, like "recently viewed" markers for quick re-access and adaptive dropdown suggestions based on past interactions.

Once users opt in to deeper integration, search can incorporate signals from services like Gmail, Calendar and Maps to infer upcoming trips, reservations, commutes, team meetings, or frequently visited locations. This context allows the system to tailor dining, shopping, and travel suggestions by grounding recommendations in the user’s real-world patterns. This turns search from public answer engine to personal information assistant —one that understands your context, anticipates your needs, and tailors results without compromising universal truth or user control.

Evaluating Search Quality

Evaluating modern, agentic search systems is fundamentally more complex than evaluating classical web search. Evaluation of agentic search systems measure retrieval quality, semantic alignment, and grounded generation. A robust evaluation framework therefore combines offline and online methods.

Offline Evaluation: Benchmarks, Ground Truth & Simulation

Offline evaluation uses benchmark datasets and human annotations to verify core model capabilities in a controlled, repeatable fashion. Performance is measured across all components of the stack.

Retrieval & Ranking Metrics

Metrics like Recall@K, MRR (position of first relevant), MAP (precision across recall levels) and NDCG@K(rewards highly relevant near top with logarithmic discount, normalized to [0,1]) remain foundational.

Note that public benchmarks like BEIR/ MTEB (ranking), HotpotQA / 2WikiMultihopQA (multi-hop retrieval) are widely used in open-source and academic evaluation of agentic search stacks, but not as a guarantee of real-world production quality due to scale, latency, freshness, vertical coverage, and legal/compliance filters requirements in production systems.

LLMs evaluating retrieval quality or passage relevance is now standard. For example, Google DeepMind’s G-Eval, Meta’s Llama-3 Eval Guide, and OpenAI’s own “Model Grading” approach.

Human-in-the-loop evaluation: Human raters manually review results for sampled queries to assess relevance, reliability, page quality, usefulness, diversity, etc.

Grounded Generation Quality

This often combines human annotation, “LLM-as-judge” scoring (G-Eval, CoVe, ICE), and reference-based evaluation.

Grounding Score / Attribution Score

Measures whether generated claims are tied to retrieved passages.

Factual Consistency

Evaluated via LLM-as-Judge or span-level alignment (e.g., FactScore, AlignScore).

Faithfulness / Hallucination Detection

Benchmarks like RAGAS, ARES, G-Eval, TruthfulQA, QAG, and FACTS-Graph.

Multimodal Grounding metrics

Used when images/screenshots/videos enter search workflows. They measure correct visual interpretation, alignment between visual features and textual descriptions, precision of visual grounding (e.g., bounding boxes and region-text alignment), and effectiveness of retrieval when queries combine text and visuals — ensuring the model truly understands and grounds answers in the visual content rather than hallucinating or retrieving irrelevant material. Typical public metrics:

Image–Text Retrieval Recall@K (CLIP, SIGLIP benchmarks)

VQA accuracy (Visual Question Answering datasets)

MM-Safety Benchmarks

Multimodal hallucination metrics

Offline evaluation is safe, repeatable, and controlled, but cannot capture real-time behavior or long-tail queries.

Online Evaluation: Real Users, Data, Feedback

Online evaluation refers to the live, production-time assessment of search quality using real user interactions. It measures how well the system perform under real-world constraints such as latency, freshness, and user behavior, and evaluate changes to the system through rigorous live experiments, A/B testing, and user interaction analysis before launch. These online studies complement quality-rater evaluations and ensure that updates improve production-time performance for real users.

Live Experiments and A/B Testing:

A/B experiments on real traffic, testing new algorithms against control versions.

Live Traffic Evaluations, number of active users; query volume and engagement; adoption tracking,

User studies, Beta/alpha feedback shaped features; stress-testing for UX evaluation.

Large-scale logs analysis to evaluate patterns across many queries, refine AI responses, and address issues like hallucinations in live traffic.

2. User Interaction Signals

Search engines use aggregated, anonymized interaction data to understand whether results are helpful, including clicks, time spent on pages, query reformulations, broader browsing patterns. These signals are not used as pure optimization targets but help systems understand when results appear to be relevant or satisfying for users.

Performance and Latency Monitoring

Delivering high-quality search results requires attention to latency as people expect quality results in a fraction of a second. Therefore, online evaluation continuously measures:

Response time of search results

Impact of new algorithms on system speed

Overall user experience under real load

Freshness and Real-Time Behavior

Search engines also need to handle constantly changing information. Online evaluation therefore monitors how well systems:

React to new or updated content

Surface fresh material when appropriate

Maintain quality as the web evolves

This real-time monitoring helps detect drift, such as changes in popular queries or shifts in web content.

Human and Automated Feedback Loops

Online evaluation works hand-in-hand with:

Quality Rater evaluations： Human reviewers judge relevance and usefulness using the guidelines.

Automated monitoring systems：Detect anomalous behavior, ranking regressions, or unexpected shifts.

These combined signals ensure that online system changes improve overall search quality, not just isolated metrics.

Challenges in Modern Search

Despite advances, challenges abound.

1. Reliability

As search evolves into generative, the risk of generation hallucinations (incorrect assertions, invented facts, misleading summaries) rises. Ensuring factuality remains a major challenge.

2. Freshness

Search must continuously integrate new content, and surface timely, up-to-date information across webpages, news, media, real-time data (weather, stocks, events). However, for agentic search (which may answer sophisticated, time-sensitive queries), maintaining real-time content freshness — while avoiding stale data — becomes harder. Content evolves quickly; verifying freshness while preserving speed is nontrivial. “Drift” in content, user behavior, or world state may degrade the performance over time, requiring continuous monitoring and adaptation.

3. Latency, Performance & Scalability

Agentic search involves heavier compute (retrieval + reranking + generation + possibly tool calls), making low-latency returns harder. Scaling such systems globally while keeping responsiveness high (across mobile, desktop, varying network conditions, and for multimodal content) remains a major infrastructure challenge.

4. Broad Coverage

As search synthesize across verticals (news, shopping, flights, local, multimodal) and multi-modal content (text, image, video, structured data), it is difficult to integrate and normalize such heterogeneous sources while preserving reliability.

5. Privacy, Personalization & Moderation

As search becomes more personalized, privacy and safety become more important. Systems must avoid exposing sensitive data (user history, preferences, context, possibly personal data) or creating filter-bubble outputs.

Additionally, content-moderation — avoiding harmful content, misinformation, bias — becomes harder at scale, especially when combining sources or generating novel text.

6. Transparency, Explainability, and Trust

For users to trust the answers, they need clear attribution (where information comes from), ability to see original sources, and understand why an answer was generated. As search becomes more generative, the difficulty increases.

For publishers/content creators, the shift to generative content raise concerns around traffic diversion, visibility, and fairness. When systems apply “warning banners” to filter out low-quality content or misinformation, lack of clarity over criteria erodes user trust.

7. Ethical Impacts

As search becomes more personalized and AI-driven, there’s a risk of reinforcing existing biases, limiting or distorting viewpoint diversity. Guaranteeing fairness, mitigating bias, ensuring neutrality in generative outputs across languages/regions/content types remains a systemic challenge.

Conclusion

Agentic search engines no longer treat queries as simple string-to-document matches. They interpret user intent, decompose complex questions, perform multi-step reasoning over retrieved information, and integrate semantic search with LLM-based synthesis to produce grounded, structured answers that go far beyond traditional search.

References

Zhang, W., Li, Y., Bei, Y., Luo, J., Wan, G., Yang, L., Xie, C., Yang, Y., Huang, W.-C., Miao, C., et al. (2025). From Web Search Toward Agentic Deep Research: Incentivizing Search with Reasoning Agents.

Chen, J., Jiang, X., Wang, Z., Zhu, Q., Zhao, J., Hu, F., Pan, K., Xie, A., Pei, M., Qin, Z., Zhang, H., Zhai, Z., Guo, X., Zhou, R., Wang, K., Geng, M., Chen, C., Lv, J., Huang, Y., Liang, X., & Li, H. (2025). UniSearch: Rethinking Search System with a Unified Agentic Architecture.

Andhavarapu, Abhishek. (2025). Demystifying Distributed Search Systems: Architecture and Principles. International Journal of Computer Engineering and Technology

Pascal J. Sager and Ashwini Kamaraj and Benjamin F. Grewe and Thilo Stadelmann. (2025). Deep Retrieval at CheckThat! 2025: Identifying Scientific Papers from Implicit Social Media Mentions via Hybrid Retrieval and Re-Ranking

Google Cloud. (2025). Vertex AI RAG Engine overview. Google Cloud Documentation

Google Cloud. (2024). Grounding with Google Search — Vertex AI Generative AI Studio Documentation.

Aisha Malik. (2025). Google Brings Gemini in Chrome to U.S. Users, Unveils Agentic Browsing Capabilities, and More. TechCrunch.

Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P. E., ... & Jégou, H. (2025). The faiss library. IEEE Transactions on Big Data

Singh, A., Ehtesham, A., Kumar, S., & Khoei, T. T. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG.

Guanting Dong, Jiajie Jin, Xiaoxi Li, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. 2025. RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics

Gao, Y., Xiong, C., Gao, Z., Jia, J., Yin, J., Zhao, J., Sun, L., & Yu, H. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey.

Baban, H., Pidaparthi, S. A., Gulati, S., & Nema, A. (2025). Optimizing Retrieval-Augmented Generation with Multi-Agent Hybrid Retrieval.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems

Zha, Y., Yang, Y., Li, R., & Hu, Z. (2023). AlignScore: Evaluating factual consistency with a unified alignment function. arXiv preprint arXiv:2305.16739.

Es, S., James, J., Anke, L. E., & Schockaert, S. (2024, March). Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W. T., Koh, P., ... & Hajishirzi, H. (2023, December). Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing