Design, Develop, and Deploy Production-Ready RAG Applications
Ofer Mendelevitch and Forrest Sheng Bao

#RAG
#GraphRAG
#LLM
#AI_Agent
🔧 RAG (Retrieval-Augmented Generation) (تولید تقویتشده با بازیابی اطلاعات) شده روش استاندارد برای وصل کردن مدلهای زبانی بزرگ به دانش اختصاصی سازمانها. مشکل اینجاست که الان بازار پر از RAG pipelineها و کامپوننتهای مختلفه و انتخاب اینکه کدومش برای نیاز enterprise درست کار میکنه واقعاً ساده نیست. این کتاب دقیقاً همین مسئله رو جمع میکنه و یه نقشه راه کامل میده برای ساخت، بهینهسازی و مقیاسپذیر کردن RAG در سطح production.
📘 نویسندهها، اوفر مندلوویچ و فارست شنگ بائو، کل مسیر رو از پایه تا سطح پیشرفته توضیح میدن؛ از ورود داده (Data Ingestion)، بردارسازی (Embeddings) و جستجوی برداری گرفته تا تکنیکهای پیشرفته مثل RAG عاملمحور (Agentic RAG)، RAG چندوجهی (Multimodal RAG) و GraphRAG.
📊 وقتی داری RAG رو در سطح enterprise میسازی، فقط “درست کار کردن” کافی نیست. باید این چیزها هم حل بشن:
📑 چیزهایی که این کتاب بهت کمک میکنه بسازی:
🧠 تصمیم بگیری RAG رو خودت پیادهسازی کنی یا از RAG-as-a-Service استفاده کنی.
⚙️ یه RAG پایه بسازی که هم از نظر کارایی قوی باشه هم هزینهها کنترل بشه.
📏 متریکهای واقعی رو اندازه بگیری؛ مثل میزان hallucination، کیفیت پاسخ، تأخیر و هزینه.
🔐 چالشهای enterprise مثل امنیت، privacy و compliance و حتی طراحی prompt رو هندل کنی.
🧩 بری سراغ تکنیکهای پیشرفته مثل Multimodal RAG، Agentic RAG و GraphRAG.
📑 فهرست مطالب
فصل 1. معرفی بازیابی تقویتشده با تولید (RAG)
فصل 2. ساختار پایه RAG
فصل 3. مقیاسپذیر کردن RAG
فصل 4. استقرار RAG در production
فصل 5. پلتفرم RAG
فصل 6. ارزیابی اپلیکیشن RAG
فصل 7. از RAG تا عاملهای هوش مصنوعی
فصل 8. RAG چندوجهی
فصل 9. RAG تقویتشده با دانش
فصل 10. آینده RAG
🧠 این کتاب برای کیه؟
👨💻 برای آدمهایی که تو خط production سیستمهای AI کار میکنن؛ مهندس نرمافزار، ML Engineer و Data Architectهایی که سیستمشون قراره مستقیم وارد مسیر حیاتی سازمان بشه.
⚠️ این کتاب آموزش پایه نیست. فرضش اینه که با Python و مفاهیم پایه برنامهنویسی راحتی و داری وارد طراحی سیستم واقعی میشی.
🚫 این کتاب برای کی نیست؟
🐍 اگر هنوز فرق list و dictionary رو راحت نمیدونی، این کتاب برات سنگینه.
🧪 اگر دنبال تئوریهای ریاضی عمیق شبکههای عصبی یا Transformerها هستی، اینجا تمرکز روی اون نیست.
🧰 اگر دنبال ساخت بدون کدنویسی هستی، این کتاب اصلاً برای اون مدل کار نیست.
🧱 این کتاب درباره چیه؟
📉 خیلی از تیمها وقتی RAG رو میارن تو production فکر میکنن مشکل از خود تکنولوژیه، ولی واقعیت اینه که تفاوت demo و production خیلی عمیقتر از این حرفهاست.
این کتاب دقیقاً نقش پل رو بین این دو دنیا بازی میکنه.
🚀 بعد از خوندن این کتاب چی بلدی؟
🔎 جستجوی برداری ساده رو ارتقا میدی به جستجوی ترکیبی، رتبهبندی مجدد و حتی گراف دانش.
❌ مشکل hallucination رو با کنترلهای بازیابی و grounding بهتر کاهش میدی.
🖼️ سیستم رو چندوجهی میکنی (جدول، تصویر، نمودار، ویدیو).
📊 ارزیابی رو از تستهای حسی میبری به متریکهای واقعی و قابل اندازهگیری.
⚡ سیستم رو برای تأخیر و هزینه واقعی production بهینه میکنی.
✍️ درباره نویسندهها
👨💻 اوفر مندلوویچ مسئول Developer Relations در Vectara هست. سالها تو حوزه یادگیری ماشین، علم داده و سیستمهای کلانداده کار کرده و از ۲۰۱۹ تمرکزش روی محصولات مبتنی بر LLM بوده. قبلتر در Yahoo!، Hortonworks و شرکتهای مختلف روی سیستمهای داده و ML کار کرده.
🎓 لیسانس علوم کامپیوتر از Technion و فوقلیسانس مهندسی برق از Tel Aviv University داره و نویسنده کتاب Practical Data Science with Hadoop هم هست.
👨🔬 فارست شنگ بائو هم هممدیر تیم Machine Learning در Vectara هست. بیش از ۱۰ سال تجربه در AI و NLP داره. قبلاً Assistant Professor در Iowa State University بوده.
🎓 دکتری علوم کامپیوتر با گرایش فرعی مهندسی برق از Texas Tech University داره.
Retrieval-augmented generation (RAG) is the go-to strategy for integrating large language models with your organization's unique knowledge. However, the market is full of RAG pipelines and components, making it hard to choose the right solution for your enterprise's needs. This book simplifies the process, offering a comprehensive road map to building, refining, and scaling production-grade RAG applications.
Authors Ofer Mendelevitch and Forrest Bao guide you through every phase of development, from data ingestion, embeddings, and vector search to advanced techniques like agentic RAG, multimodal RAG, and GraphRAG. Engineers and architects will learn how to tackle the challenges they'll encounter when building RAG applications at enterprise scale: ensuring high accuracy with minimal hallucinations, maintaining low-latency performance, safeguarding data privacy, and providing transparent, explainable responses among them.
Table of Contents
Chapter 1. Introduction to Retrieval-Augmented Generation (RAG)
Chapter 2. The Base RAG Stack
Chapter 3. Scaling Your RAG Stack
Chapter 4. Deploying RAG to Production
Chapter 5. The RAG Platform
Chapter 6. Evaluating Your RAG Application
Chapter 7. From RAG to AI Agents
Chapter 8. Multimodal RAG
Chapter 9. Knowledge-Enhanced RAG
Chapter 10. The Future of RAG
Who This Book Is For
This book is for the builders in the trenches of the AI era—the software engineers, machine learning engineers, and data architects who know that the distance between a successful pip install and a reliable production system is measured in sleepless nights.
You are likely responsible for putting RAG systems on the critical path: the systems that customers, employees, and leadership now depend on. You aren’t looking for another tutorial on prompt engineering; you are tasked with the structural heavy lifting. Whether you are designing document pipelines that don’t choke on complex PDF files, implementing guardrails to kill hallucinations, or building the evaluation frameworks that prove your system actually works, this book is your guide.
While this is primarily an engineering text, it serves as a reality check for technical product managers and architects. If you define requirements, you need to understand the mechanical limits of RAG systems, and the role each component plays in the RAG stack. This book provides the technical intuition to distinguish between a realistic latency budget and a fantasy, ensuring you don’t promise features that physics and compute costs can’t deliver.
Who This Book Is Not For
To ensure this book is the right fit for your current journey, it is important to note that we skip the introductory basics. This is an advanced engineering guide, not a foundational Python course. We assume a level of comfort with Python’s core structures and basic programming patterns; if you are still distinguishing between lists and dictionaries, you will likely find the technical depth of our implementations more frustrating than helpful.
Furthermore, our lens is strictly focused on applied AI rather than academic theory. While we dive deep into the orchestration and optimization of RAG systems, we don’t spend time on the underlying calculus of neural networks, or the mathematical proofs behind transformer architectures.
Finally, this is a “hands-on” book in the literal sense—the code snippets throughout the book and the associated GitHub repository (which includes full code samples) are important to gain full understanding of the material. It is not intended for “no-code” enthusiasts or casual consumers. If your goal is to assemble RAG applications without engaging directly with code, system design, and debugging, this book will likely feel misaligned with your expectations.
What This Book Is About
Many developers hit the production wall and assume the technology is flawed. It isn’t. The problem is that the techniques used to build a demo are fundamentally different from those required to build an enterprise-scale product.
This book is the bridge across that chasm. We tackle the unique operational challenges of RAG in production. By the end of this journey, you will be equipped to do the following:
Implement high-precision retrieval: Move beyond simple vector search to leverage hybrid search, relevance reranking, or knowledge graphs, ensuring accuracy for complex questions at enterprise scale.
Eliminate hallucinations: Diagnose and reduce large language model (LLM) “hallucinations” using retrieval-aware guardrails, while ensuring your RAG system has the most up-to-date enterprise data for grounding its responses.
Integrate multimodal content: Expand your system’s capabilities to accurately interpret tables, images, diagrams, and videos, and integrate their information content into the RAG responses.
Establish rigorous evaluation: Move away from “vibe-based” testing—the habit of asking the chatbot three questions and assuming it works because the answers “look” right—toward repeatable, automated metrics that provide a statistical guarantee of reliability.
Optimize for the real world: Make informed build-versus-buy decisions and deploy systems that survive real-user latency constraints and deep observability requirements.
Our focus is RAG-specific resiliency: turning a brittle demo into a hardened enterprise asset. While we respect the foundations of general systems engineering, this book isn’t a generic primer on continuous integration and continuous delivery (CI/CD) or cloud infrastructure. Instead, we provide the blueprints to solve for the unique failure modes of RAG—from low-latency, high-accuracy retrieval optimization to deep observability—focusing on the design and implementation of a system that is visible, measurable, and reliable under the weight of production traffic and the messiness of enterprise data.
Ofer Mendelevitch leads developer relations at Vectara. He has extensive hands-on experience in machine learning, data science and big data systems across multiple industries, and has focused on developing products using large language models since 2019. Prior to Vectara, he built and led data science teams at Syntegra, Helix, Lendup, Hortonworks, and Yahoo! Ofer holds a B.Sc. in computer science from Technion and M.Sc. in EE from Tel Aviv university, and is the author of "Practical data science with Hadoop" (Addison Wesley).
Forrest Sheng Bao co-leads the Machine Learning team at Vectara. He has over 10+ years of research experience in the areas of Artificial Intelligence (AI) and Natural Language Processing (NLP). Prior to Vectara, he was an assistant professor at Iowa State University. Forrest holds a PhD in computer science with a minor in electrical engineering from Texas Tech University.









