8 min read

Semantic Caching for LLM Applications: How I Cut Our API Bill by 60%

40% of our LLM calls were near-duplicates — same question, different words, full price. Semantic caching with pgvector dropped the monthly bill from $215 to $85.

AICachingCost Optimizationpgvector

I was reviewing our AI billing dashboard at Lit Alerts when I noticed something strange. Forty percent of our LLM calls were near-duplicates. Users were asking the same questions with slightly different phrasing — "What are the reporting requirements?" vs. "What reports do I need to file?" vs. "Tell me about reporting obligations." Each one triggered a fresh API call. Same context, same answer, different words, full price every time.

That was when I implemented semantic caching, and our monthly API bill dropped by 62%.

Why Traditional Caching Fails for LLMs

If you have ever built a web application, you know how caching works. You hash the request, store the response, and serve it from cache when the same request comes in. For APIs with deterministic inputs, this works perfectly.

LLMs break this model because semantically identical questions have different string representations. A traditional cache keyed on the exact prompt string would miss nearly every cache opportunity, because users never ask the same question the same way twice.

Semantic caching solves this by comparing the meaning of queries rather than their exact text. If a new query is semantically similar enough to a cached query, you serve the cached response.

The Architecture

A semantic cache has three components: an embedding model to convert queries to vectors, a vector store to find similar cached queries, and a similarity threshold to decide when a cache hit is "close enough."

interface CachedResponse {
  query: string;
  embedding: number[];
  response: string;
  model: string;
  createdAt: Date;
  hitCount: number;
}

class SemanticCache {
  private similarityThreshold: number;

  constructor(threshold = 0.92) {
    this.similarityThreshold = threshold;
  }

  async get(query: string): Promise<string | null> {
    const queryEmbedding = await this.embed(query);

    const result = await db.query(
      `SELECT query, response, 1 - (embedding <=> $1::vector) AS similarity
       FROM llm_cache
       WHERE 1 - (embedding <=> $1::vector) > $2
       ORDER BY embedding <=> $1::vector
       LIMIT 1`,
      [JSON.stringify(queryEmbedding), this.similarityThreshold]
    );

    if (result.rows.length > 0) {
      // Update hit count for analytics
      await db.query(
        `UPDATE llm_cache SET hit_count = hit_count + 1 WHERE query = $1`,
        [result.rows[0].query]
      );
      return result.rows[0].response;
    }

    return null;
  }

  async set(query: string, response: string, model: string): Promise<void> {
    const embedding = await this.embed(query);

    await db.query(
      `INSERT INTO llm_cache (query, embedding, response, model, created_at, hit_count)
       VALUES ($1, $2, $3, $4, now(), 0)
       ON CONFLICT (query) DO UPDATE SET response = $3, created_at = now()`,
      [query, JSON.stringify(embedding), response, model]
    );
  }

  private async embed(text: string): Promise<number[]> {
    const result = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: text,
    });
    return result.data[0].embedding;
  }
}

I use pgvector for the vector store because — as I wrote about in my RAG pipeline post — it means one fewer service to manage. The llm_cache table lives in the same PostgreSQL database as everything else.

Choosing the Right Similarity Threshold

The threshold is the most important parameter, and getting it wrong in either direction is painful.

Too high (> 0.95): You rarely get cache hits. Most queries that are semantically identical score between 0.88 and 0.95 because they use different words. You are paying for embeddings without getting cache savings.

Too low (< 0.85): You serve stale or incorrect responses. "What are the reporting requirements for Oregon?" and "What are the licensing requirements for Oregon?" score around 0.87 — similar enough to pass a low threshold, but different enough to need different answers.

I started at 0.90 and tuned over two weeks of production traffic. The sweet spot for Complai's compliance queries was 0.92. For more general chatbot use cases, 0.88–0.90 tends to work well.

// Log near-misses for threshold tuning
async function logNearMiss(query: string, closestMatch: string, similarity: number) {
  if (similarity > 0.80 && similarity < similarityThreshold) {
    await db.query(
      `INSERT INTO cache_near_misses (query, closest_match, similarity, created_at)
       VALUES ($1, $2, $3, now())`,
      [query, closestMatch, similarity]
    );
  }
}

Reviewing near-misses weekly is how you tune the threshold. If you see pairs that should have been cache hits, lower the threshold. If you see pairs that are clearly different questions, keep it where it is.

Cache Invalidation: The Hard Part

Semantic caching introduces a new invalidation challenge. In a traditional cache, you invalidate when the underlying data changes. In a semantic cache, you also need to invalidate when:

  1. The knowledge base changes. If you update your documents, cached answers based on old documents become stale.
  2. The model changes. If you upgrade from GPT-4o to a newer model, cached responses from the old model may be lower quality.
  3. Time-sensitive information expires. "What are the current fees?" has a different answer in January vs. July.

My approach is aggressive: I set a TTL (time-to-live) on every cached response and clear the entire cache whenever the knowledge base is updated.

// Cache with TTL
async function getCachedOrGenerate(
  query: string,
  generateFn: () => Promise<string>,
  ttlHours = 24
): Promise<{ response: string; fromCache: boolean }> {
  const cached = await cache.get(query);

  if (cached) {
    return { response: cached, fromCache: true };
  }

  const response = await generateFn();
  await cache.set(query, response, 'gpt-4o');

  return { response, fromCache: false };
}

// Nuclear invalidation on knowledge base update
async function onKnowledgeBaseUpdate(): Promise<void> {
  await db.query(`DELETE FROM llm_cache`);
  console.log('Cache cleared after knowledge base update');
}

This is deliberately simple. More sophisticated approaches — like invalidating only cached responses that reference changed documents — add complexity that was not worth it for our use case. When in doubt, clear the cache. The cost of regenerating a few responses is nothing compared to the cost of serving stale compliance advice.

The Economics

Let me break down the actual numbers from Lit Alerts.

Before semantic caching:

  • ~2,400 LLM calls per day
  • Average 1,200 tokens per call
  • Monthly token consumption: ~86 million tokens
  • Monthly cost at $2.50/M tokens: ~$215

After semantic caching (62% hit rate):

  • ~2,400 user queries per day
  • ~912 actual LLM calls (62% served from cache)
  • ~180 embedding calls for cache lookups (cheap)
  • Monthly token consumption: ~33 million tokens
  • Monthly cost: ~$82 + ~$3 for embeddings = ~$85

That is a $130/month savings for a relatively small application. For enterprise deployments processing millions of queries, the savings scale linearly.

The embedding cost for cache lookups is negligible. Text-embedding-3-small costs $0.02 per million tokens. Even at high volume, the embedding cost is a rounding error compared to the LLM cost savings.

When NOT to Cache

Semantic caching is not appropriate for every LLM interaction:

  • Conversations with context: If the response depends on the full conversation history, caching by query alone will produce incorrect results. You would need to include the conversation context in the cache key, which defeats the purpose.
  • Creative generation: If the user wants novel output every time (marketing copy, brainstorming), caching the same response defeats the purpose.
  • Real-time data: If the response depends on data that changes frequently (stock prices, live dashboards), a stale cache is worse than no cache.
  • Personalized responses: If the answer depends on user-specific data, you need per-user caches, which dramatically reduce hit rates.

For factual Q&A over a relatively stable knowledge base — which is exactly what Complai and Lit Alerts do — semantic caching is one of the highest-ROI optimizations you can implement.

Getting Started

If you are running any LLM-powered feature with repetitive queries, start here:

  1. Log your queries for a week. Look at the distribution. If you see clusters of similar questions, semantic caching will save you money.
  2. Add a cache table with pgvector. It is five lines of SQL.
  3. Set a conservative threshold (0.92+). You can always lower it after reviewing near-misses.
  4. Set a short TTL (24 hours). Better to miss a few cache opportunities than to serve stale data.
  5. Track your hit rate. If it is below 30%, your queries might be too diverse for semantic caching. If it is above 70%, you are leaving money on the table by not implementing it sooner.

The implementation took me a day. The payback period was three days. That is the kind of optimization I wish I had known about six months earlier.


References

Ask about Kyle
AI-powered resume assistant

Ask me about Kyle's skills, experience, or projects