The Core Idea
Instead of asking an LLM to answer from its training data alone, RAG retrieves relevant documents from your knowledge base first, then passes them to the LLM as context.
Stage 1: Ingestion and Chunking
Source documents are loaded and split into chunks. Chunk size matters enormously — too small loses context, too large dilutes signal. We typically use 400–800 tokens per chunk with 10–20% overlap.
Stage 2: Embedding
Each chunk is converted to a vector embedding. We use Azure OpenAI's text-embedding-3-large (3,072 dimensions) for production systems.
Stage 3: Vector Storage
Embeddings are stored in a vector database: Pinecone for cloud-managed simplicity, pgvector for PostgreSQL integration, or Qdrant for self-hosted control.
Stage 4: Retrieval
At query time, the user's question is embedded with the same model. A similarity search finds the top-k most semantically similar chunks.
Stage 5: Reranking
A cross-encoder reranker scores retrieved chunks for relevance to the specific query. This two-stage approach dramatically improves precision without sacrificing recall.
Building a RAG system? We've shipped them across healthcare, legal, and enterprise.
Discuss Your Use Case →