Before you start: the two phases of every RAG system
Every RAG pipeline has exactly two phases, and it helps to keep them mentally separate. The first is indexing — a one-time preparation step where you process your documents and store them in a searchable form. The second is querying — what happens in real time each time a user asks a question. Understanding this split will make the rest of the guide click into place.
Think of indexing like building a library: you catalogue every book, assign it a shelf, and create an index so you can find things fast. Querying is like walking into that library with a specific question — the catalogue does the heavy lifting and hands you the right pages instantly.
Phase 1 — Indexing your documents
Step 1: Load your documents
Your knowledge base can be almost any text source: PDFs, Word documents, web pages, Notion pages, database records, or plain text files. Most RAG frameworks provide ready-made loaders for all common formats. At this stage you are simply reading the raw content into your pipeline — no transformation yet.
One practical tip: be intentional about what you include. A RAG system is only as trustworthy as its source documents. Outdated policies, contradictory versions of the same document, or low-quality content will all surface in the AI's answers. Clean, authoritative sources produce clean answers.
Step 2: Chunk your text
Chunking means splitting each document into smaller, focused passages before indexing them. If you store an entire 50-page manual as a single unit, its embedding will represent a blurry average of everything in it — too vague to retrieve precisely. Breaking it into focused 300–500 word passages means each chunk has a clear meaning that can be matched to a specific question.
A good starting point is fixed-size chunking with overlap: split text every 400 words, but let consecutive chunks share their last 50 words. The overlap ensures that a sentence sitting at the boundary between two chunks is not lost from either. This simple approach works well for most document types.
| Chunking Strategy | Best For | Watch Out For |
|---|---|---|
| Fixed-size with overlap | General documents, a reliable starting point for beginners | May split mid-sentence on dense technical content |
| Sentence-based | Conversational content, FAQs, support transcripts | Very short sentences can produce chunks without enough context |
| Section/heading-based | Structured docs like manuals, legal contracts, reports | Requires consistent formatting in source documents |
| Semantic chunking | Mixed-topic documents where topics shift mid-page | More complex to set up; needs an embedding model at chunk time |
Step 3: Generate embeddings
An embedding model reads each chunk of text and converts it into a vector — a long list of numbers that represents the meaning of that passage. Two passages with similar meanings will have numerically similar vectors, even if they use completely different words. This is what makes semantic search possible: the system can find a passage about "cancellation terms" even if the user asked about "how to exit a contract."
You do not need to understand the mathematics of embeddings to use them. What you do need to remember is this: always use the same embedding model to index your documents and to embed user queries. Mixing models produces vectors in incompatible spaces, and all similarity scores become meaningless.
| Embedding Model | Provider | Best For |
|---|---|---|
| text-embedding-3-small | OpenAI API | Best balance of cost and quality for most projects |
| text-embedding-3-large | OpenAI API | Higher accuracy when retrieval quality is critical |
| bge-small-en-v1.5 | Hugging Face (open-source) | Runs locally, no API costs, good for private data |
| embed-english-v3.0 | Cohere API | Strong multilingual support if you need multiple languages |
Step 4: Store in a vector database
A vector database stores your embeddings and is optimised to search them extremely fast — even across millions of entries. When a user asks a question, the database finds the stored chunks whose vectors are closest in meaning to the question vector. This is called approximate nearest-neighbor search, and it is what makes RAG retrieval feel instant.
Alongside each vector, you also store the original chunk text and any metadata — the source document name, page number, date, or department. This metadata lets you filter searches later (for example, "only search documents from the legal department") and lets you show users where an answer came from.
| Vector Database | Type | Best For |
|---|---|---|
| ChromaDB | Open-source, runs locally | Prototyping and small projects — zero infrastructure setup |
| Pinecone | Fully managed cloud service | Production systems where you want no infrastructure to manage |
| Weaviate | Open-source, self-hosted or cloud | Production with hybrid search (keyword + semantic combined) |
| pgvector (Postgres) | Extension for existing Postgres DB | Teams already on Postgres who want to avoid a separate service |
| Qdrant | Open-source, self-hosted or cloud | High-performance use cases with filtering and payload support |
Phase 2 — Answering a query
Step 5: Embed the user's question
When a user submits a question, the first thing the system does is run it through the same embedding model used during indexing. This converts the question into a vector using the same mathematical space as your stored chunks — making comparison possible.
Step 6: Search the vector database
The question vector is sent to the vector database, which returns the top-k most semantically similar chunks from your document store. Top-k refers to how many chunks you retrieve — typically 3 to 5 is a good starting point. Too few and you may miss relevant context; too many and you flood the AI with information, increasing cost and potentially confusing the answer.
Many production systems also apply metadata filters at this stage — for example, only searching documents tagged for a specific product line, date range, or department. This improves precision and keeps answers relevant to the user's context.
Step 7: Build the prompt
The retrieved chunks and the user's original question are assembled into a single prompt that gets sent to the language model. The structure of this prompt matters more than most beginners expect. A well-designed prompt instructs the model to answer only from the provided context, to acknowledge when the answer is not in the documents, and optionally to cite which passage it drew from.
A solid baseline prompt template: "You are a helpful assistant. Answer the question below using only the provided context. If the answer is not present in the context, say so clearly — do not guess. Context: [retrieved chunks go here] Question: [user question goes here]"
Step 8: Generate and return the answer
The LLM reads the assembled prompt and generates a response grounded in the retrieved passages. Because the relevant information was explicitly provided, the model does not need to rely on its training memory — significantly reducing the chance of a fabricated or outdated answer.
For user-facing applications, it is good practice to surface the source documents alongside the answer — showing which files or sections the answer came from. This builds trust and lets users verify the information themselves.
Choosing your framework
Rather than building each of these steps from scratch, most developers use a RAG framework that connects the pieces together. The two most widely adopted options are LangChain and LlamaIndex. Both handle document loading, chunking, embedding, vector database integration, and prompt assembly out of the box.
| LangChain | LlamaIndex | |
|---|---|---|
| Best for | Flexible pipelines, agents, multi-step workflows | Document Q&A and search-focused applications |
| Learning curve | Moderate — more concepts to learn upfront | Gentler — focused API, easier to get started |
| Ecosystem | Very large, many integrations | Large, strong focus on data connectors |
| When to pick it | You need more than just RAG (agents, tools, chains) | Your primary goal is querying over documents |
Common mistakes to avoid
- Chunks that are too large. A 2,000-word chunk embeds into one vector that represents too many ideas at once. Retrieval becomes imprecise. Keep chunks focused — 300 to 500 words is a reliable range.
- No overlap between chunks. A sentence split across two consecutive chunks loses context in both. Always use a small overlap (50–100 words) between adjacent chunks.
- Mismatched embedding models. Using one model to index and a different one at query time produces incomparable vectors. Pick one model and use it everywhere.
- Retrieving too few results. If the answer spans multiple sections of a document, retrieving only one chunk may miss half the picture. Start with top-k=5.
- No metadata on stored chunks. Without source metadata, you cannot filter searches, attribute answers, or audit where information came from. Always store the document name, section, and date alongside each chunk.
- Forgetting to re-index when documents change. RAG is only as current as your vector store. Build a process to re-index whenever source documents are updated or added.
How to evaluate whether your RAG system is working
Testing RAG quality is often overlooked by beginners, but it is what separates a proof-of-concept from a reliable production system. The good news is you do not need sophisticated tooling to start — a simple evaluation set goes a long way.
Build a list of 20 to 30 questions that you know the answers to from your documents. Run each question through your pipeline and check three things: did the system retrieve the right chunks, did the LLM use them correctly, and is the final answer accurate? This manual spot-check will quickly reveal whether your chunking strategy, top-k value, or prompt design needs adjustment.
| What to Check | What It Tells You | How to Fix It |
|---|---|---|
| Retrieved chunks are irrelevant | Embedding model or chunk size may be mismatched to your content | Try a different embedding model or reduce chunk size |
| Right chunks retrieved but wrong answer | Prompt design is not grounding the model properly | Tighten the system prompt to restrict the model to the provided context |
| Answer is correct but no source cited | The model is not attributing its response | Add citation instructions to your prompt template |
| System says it does not know when it should | Retrieval is failing to surface the relevant chunk | Increase top-k, review chunking boundaries, check metadata filters |
A note on production readiness
The architecture described here is the standard RAG baseline — and it is a solid foundation. As your system grows, you will encounter refinements like hybrid search (combining semantic and keyword retrieval), re-ranking retrieved results for better precision, query rewriting to handle ambiguous user questions, and caching frequent queries to reduce latency and cost. These are not beginner concerns, but it is useful to know they exist as natural next steps.
For most first projects, the baseline pipeline described in this guide will take you far. Start simple, evaluate honestly, and improve incrementally based on what your test questions reveal.
Summary: the full pipeline at a glance
- Load your source documents (PDFs, databases, web pages, internal files).
- Chunk them into focused 300–500 word passages with 50-word overlap.
- Embed each chunk using a consistent embedding model.
- Store vectors and metadata in a vector database (ChromaDB to start, Pinecone for production).
- Embed the user's question using the same embedding model.
- Retrieve the top 3–5 most semantically similar chunks from the database.
- Build a prompt combining the retrieved chunks and the user's question.
- Generate a grounded answer using an LLM, and surface the source documents to the user.
New to the concept of RAG entirely? Start with our companion article What is RAG in AI? A Simple Explanation for the business-level overview before diving into this guide.