What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that improves LLM responses by feeding them relevant context from your own data. Instead of relying solely on the model's training data, RAG retrieves specific documents or passages that are relevant to the user's query.

The result: more accurate, up-to-date, and grounded responses.

Why RAG Matters

LLMs have two fundamental limitations:

1. Knowledge cutoff — They don't know about events after their training date 2. Hallucinations — They sometimes generate plausible but incorrect information

RAG solves both problems by grounding the LLM's responses in your actual data.

The RAG Pipeline

A typical RAG pipeline has four stages:

1. Chunking — Split your documents into smaller pieces 2. Embedding — Convert chunks into vector representations 3. Storage — Store vectors in a database for fast retrieval 4. Retrieval + Generation — Find relevant chunks and feed them to your LLM

Let's build each stage.

Step 1: Chunking Your Documents

The quality of your chunks directly impacts retrieval quality. Here are the key decisions:

Chunk size: 200-500 tokens is the sweet spot for most use cases. Too small and you lose context. Too large and you dilute the signal. Overlap: Add 50-100 tokens of overlap between chunks so you don't lose information at boundaries.

function chunkText(text, maxTokens = 400, overlap = 50) {
  const words = text.split(' ')
  const chunks = []
  let start = 0
while (start < words.length) {
    const end = Math.min(start + maxTokens, words.length)
    chunks.push(words.slice(start, end).join(' '))
    start = end - overlap
  }
return chunks
}

Pro tip: Chunk by semantic boundaries (paragraphs, sections) when possible, not just token count.

Step 2: Embedding Your Chunks

Each chunk needs to be converted into a vector. Choose an embedding model based on your content type:

General text: OpenAI text-embedding-3-small ($0.02/1M tokens)
Code: Voyage voyage-code-3 ($0.06/1M tokens)
Multilingual: Cohere embed-multilingual-v3.0 ($0.10/1M tokens)

// Using EmbedRoute (OpenAI-compatible)
const response = await fetch('https://api.embedroute.com/v1/embeddings', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer er_your_key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'openai/text-embedding-3-small',
    input: chunks,
  }),
})
const { data } = await response.json()
const vectors = data.map(d => d.embedding)

Batch your requests. Most APIs accept arrays of strings. Sending 100 chunks in one call is much faster than 100 individual calls.

Step 3: Storing Vectors

You need a vector database for fast similarity search. Popular options:

Database	Best For	Setup
pgvector	Already using PostgreSQL	Extension install
Pinecone	Managed, zero ops	Cloud signup
Qdrant	Self-hosted, fast	Docker or cloud
Chroma	Prototypes	pip install

Here's a pgvector example:

CREATE EXTENSION vector;
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT,
  embedding vector(1536)
);
CREATE INDEX ON documents
  USING ivfflat (embedding vector_cosine_ops);

Step 4: Retrieval and Generation

When a user asks a question:

1. Embed the query using the same model 2. Find the top-k most similar chunks 3. Include them in the LLM prompt

// 1. Embed the query
const queryEmbedding = await getEmbedding(userQuery)
// 2. Find similar chunks (pgvector example)
const results = await db.query(
  'SELECT content FROM documents ORDER BY embedding <=> $1 LIMIT 5',
  [queryEmbedding]
)
// 3. Build the prompt
const context = results.rows.map(r => r.content).join('\n\n')
const prompt = Answer based on this context:\n\n${context}\n\nQuestion: ${userQuery}
// 4. Generate response
const response = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: prompt }],
})

Common Mistakes

1. Bad chunking. This is the #1 source of poor RAG quality. If your chunks don't contain coherent information, no embedding model will save you. 2. Wrong embedding model. Using a general model for code search, or a monolingual model for multilingual content. Test multiple models on your actual data. 3. Too few or too many chunks. Retrieving 1 chunk might miss context. Retrieving 20 overwhelms the LLM. Start with 3-5 and adjust. 4. No evaluation. You need a test set of questions and expected answers to measure retrieval quality. Without this, you're flying blind.

Next Steps

Test different embedding models through EmbedRoute to find what works best for your data
Experiment with chunk sizes and overlap
Add metadata filtering (date, category, source) to improve relevance
Consider reranking retrieved chunks before sending to the LLM

The embedding step is where most RAG pipelines succeed or fail. Choosing the right model for your content type makes a measurable difference in retrieval quality.

How to Build a RAG Pipeline: A Step-by-Step Guide