Building a RAG Pipeline with pgvector and TypeScript

When I started building the compliance search engine for Complai, every tutorial pointed me toward a dedicated vector database — Pinecone, Weaviate, Qdrant. The pitch was compelling: purpose-built for embeddings, blazing fast at scale. But the more I thought about it, the less it made sense for my situation. I already had a PostgreSQL database holding every other piece of data in the application. Adding a second database meant another service to manage, another set of credentials, another network hop, and another monthly bill. Then I discovered pgvector, and everything clicked.

pgvector is a PostgreSQL extension that adds vector similarity search to the database you probably already run. For a RAG pipeline — retrieval-augmented generation, where you fetch relevant context from a knowledge base and feed it to an LLM — it's a pragmatic choice that keeps your stack simple without sacrificing performance.

Here's how I built a production RAG pipeline with pgvector and TypeScript, and what I learned along the way.

Setting Up pgvector

The setup is surprisingly simple. Enable the extension, create a table with a vector column, and add an index. The vector dimension depends on your embedding model — OpenAI's text-embedding-3-small produces 1536-dimensional vectors.

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the documents table
CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  source_url TEXT,
  chunk_index INTEGER,
  embedding VECTOR(1536),
  created_at TIMESTAMPTZ DEFAULT now()
);

-- HNSW index for fast approximate nearest neighbor search
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

I chose HNSW over IVFFlat for the index. HNSW uses more memory but provides faster, more accurate search without needing to train on existing data. For a knowledge base that grows incrementally — which is every RAG system I've built — HNSW is the right default.

The Embedding Pipeline

The core of any RAG system is the ingestion pipeline: take raw documents, split them into chunks, generate embeddings, and store everything in the database. The chunking strategy matters more than most people realize.

import { openai } from './providers'
import { db } from './database'

interface DocumentChunk {
  content: string
  sourceUrl: string
  chunkIndex: number
}

function chunkText(text: string, maxTokens = 500, overlap = 100): string[] {
  const sentences = text.split(/(?<=[.!?])\s+/)
  const chunks: string[] = []
  let current = ''

  for (const sentence of sentences) {
    if ((current + sentence).length > maxTokens * 4) {
      chunks.push(current.trim())
      // Keep overlap for context continuity
      const words = current.split(' ')
      current = words.slice(-overlap).join(' ') + ' ' + sentence
    } else {
      current += ' ' + sentence
    }
  }
  if (current.trim()) chunks.push(current.trim())
  return chunks
}

async function ingestDocument(text: string, sourceUrl: string): Promise<void> {
  const chunks = chunkText(text)

  for (let i = 0; i < chunks.length; i++) {
    const embedding = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: chunks[i],
    })

    await db.query(
      `INSERT INTO documents (content, source_url, chunk_index, embedding)
       VALUES ($1, $2, $3, $4)`,
      [chunks[i], sourceUrl, i, embedding.data[0].embedding]
    )
  }
}

The single biggest factor in RAG quality isn't the LLM model or the embedding model — it's the chunking strategy. Split on sentence boundaries with overlap, not arbitrary character counts. Your retrieval accuracy will improve immediately.

I spent two weeks experimenting with chunk sizes for Complai's regulatory text. The sweet spot was 400–600 tokens per chunk with 20% overlap. Too small and you lose context. Too large and the semantic signal gets diluted in noise.

Semantic Search That Actually Works

With documents embedded and indexed, querying is straightforward. Embed the user's question with the same model, then find the nearest neighbors using pgvector's <-> operator for cosine distance.

async function findRelevantContext(
  question: string,
  limit = 5
): Promise<DocumentChunk[]> {
  const questionEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  })

  const results = await db.query(
    `SELECT content, source_url, chunk_index,
            1 - (embedding <=> $1::vector) AS similarity
     FROM documents
     WHERE 1 - (embedding <=> $1::vector) > 0.3
     ORDER BY embedding <=> $1::vector
     LIMIT $2`,
    [JSON.stringify(questionEmbedding.data[0].embedding), limit]
  )

  return results.rows
}

The WHERE clause with a similarity threshold is important. Without it, you'll always get results — even if nothing in your knowledge base is remotely relevant. A threshold of 0.3 works well as a starting point for text-embedding-3-small. Tune it based on your data.

Feeding Context to the LLM

The retrieval-to-generation handoff is where many RAG pipelines fall apart. You need to construct a prompt that gives the LLM enough context to answer accurately without blowing your token budget.

async function answerQuestion(question: string): Promise<string> {
  const context = await findRelevantContext(question)

  if (context.length === 0) {
    return "I don't have enough information to answer that question."
  }

  const contextBlock = context
    .map((doc, i) => `[Source ${i + 1}: ${doc.sourceUrl}]\n${doc.content}`)
    .join('\n---\n')

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    temperature: 0.1,
    messages: [
      {
        role: 'system',
        content: `Answer questions based ONLY on the provided context. If the context doesn't contain the answer, say so. Always cite which source you used.`,
      },
      {
        role: 'user',
        content: `Context:\n${contextBlock}\n\nQuestion: ${question}`,
      },
    ],
  })

  return response.choices[0].message.content ?? ''
}

Low temperature (0.1) is critical for factual accuracy. You want the model to stick to the provided context, not get creative. I also explicitly instruct the model to say "I don't know" rather than hallucinate — and the empty context check before even calling the LLM saves both tokens and embarrassment.

Lessons from Production

After running this pipeline in production for several months, here's what I've learned:

Embedding model selection matters less than you think. text-embedding-3-small is cheaper and works nearly as well as text-embedding-3-large for most use cases. Start small, benchmark, and upgrade only if retrieval quality is measurably insufficient.
Monitor retrieval quality. Log what chunks get returned for each query. Review the logs weekly. You'll find gaps in your knowledge base that no amount of model tuning can fix.
pgvector has limits. At around 5–10 million vectors, query performance starts to degrade unless you partition tables or move to pgvector's newer streaming scan features. For most applications, you'll never hit this ceiling.
Reindex when you change embedding models. If you switch from text-embedding-ada-002 to text-embedding-3-small, every existing embedding is now incompatible. Plan for a full re-embed migration.

The best RAG system is the one that uses infrastructure you already operate. pgvector means one fewer service to monitor, one fewer vendor to evaluate, and one fewer point of failure in your stack.

Building a RAG pipeline doesn't require a venture-funded vector database. If you're running PostgreSQL — and you probably are — pgvector gives you everything you need to build production-grade semantic search. Keep it simple, invest in your chunking strategy, and monitor what gets retrieved. The rest follows naturally.