Tekko

Language

Get in Touch

Usually respond within 24 hours

Back to BlogAI & ML

Implementing Semantic Cache-Aside for LLMs with Upstash Vector

7 min read
LLMVector DatabaseRedisPerformanceArchitecture
Implementing Semantic Cache-Aside for LLMs with Upstash Vector

Large Language Models (LLMs) have fundamentally changed how we build software, but they bring two significant challenges to production environments: high latency and unpredictable costs. If you are building a RAG (Retrieval-Augmented Generation) application or a customer support bot, you’ve likely noticed that many user queries are repetitive or semantically similar.

In traditional web development, we solve repetition with caching. However, standard key-value caching (like a simple Redis GET/SET on a query string) fails in the world of Natural Language Processing. If User A asks "How do I reset my password?" and User B asks "I forgot my password, how can I change it?", an exact-match cache treats these as two distinct misses, despite them requiring the exact same answer.

This is where Semantic Caching comes in. By using vector similarity search, we can identify queries that mean the same thing—even if the wording is different—and serve a cached response. In this article, we’ll explore the 'Semantic Cache-Aside' pattern using Upstash Vector and Upstash Redis.

The Problem with Exact-Match Caching

Traditional caching relies on deterministic keys. Usually, you hash the input (the prompt) and use that hash as a key in a store like Redis.

const cacheKey = hash(userPrompt); const cachedResponse = await redis.get(cacheKey);

This works perfectly for REST APIs where the URL and parameters are consistent. But LLM prompts are high-dimensional and fuzzy. A single character change, a typo, or a synonym results in a different hash. This leads to a low cache hit rate, forcing your application to hit the LLM provider (OpenAI, Anthropic, etc.) for almost every request. This adds 1–5 seconds of latency per turn and drains your API credits.

Enter the Semantic Cache-Aside Pattern

The Semantic Cache-Aside pattern mirrors the traditional cache-aside logic but replaces the "exact match" check with a "similarity search."

The Workflow

  1. Embed the Query: Convert the incoming user prompt into a vector embedding (a list of numbers representing meaning).
  2. Vector Search: Query a vector database (Upstash Vector) to find the most similar previously stored embedding.
  3. Threshold Check: If the similarity score of the top result is above a certain threshold (e.g., 0.95), we consider it a hit.
  4. Fetch & Return: Retrieve the associated response from the cache (Upstash Redis) and return it immediately.
  5. Cache Miss & Update: If no similar query exists, call the LLM, store the response in Redis, and index the embedding in Upstash Vector for future use.

Why Upstash Vector and Redis?

For a senior engineer, the choice of tools comes down to operational overhead and developer experience. Upstash provides serverless versions of both Redis and Vector, which is ideal for this pattern for several reasons:

  • Serverless Scaling: You don't need to manage clusters or worry about pod sizing for your vector index.
  • Low Latency: Upstash Vector is optimized for fast similarity lookups, which is the heart of the semantic cache.
  • Separation of Concerns: While you can store metadata in a vector database, using Redis for the actual cached content allows you to handle larger payloads, set TTLs (Time-To-Live) more easily, and leverage Redis's rich data types.

Implementing the Solution

Let’s look at a practical implementation using Node.js. We will use the @upstash/vector and @upstash/redis clients.

1. Initializing the Clients

import { Index } from "@upstash/vector"; import { Redis } from "@upstash/redis"; const vectorIndex = new Index({ url: process.env.UPSTASH_VECTOR_REST_URL, token: process.env.UPSTASH_VECTOR_REST_TOKEN, }); const redis = new Redis({ url: process.env.UPSTASH_REDIS_REST_URL, token: process.env.UPSTASH_REDIS_REST_TOKEN, });

2. The Core Caching Logic

Here is how we orchestrate the semantic check. We assume you have a function getEmbedding(text) that uses an embedding model like text-embedding-3-small.

async function getSemanticCachedResponse(prompt: string) { const queryEmbedding = await getEmbedding(prompt); // Search for the top 1 result in Upstash Vector const [match] = await vectorIndex.query({ vector: queryEmbedding, topK: 1, includeMetadata: true, }); // Threshold check: 0.92 is often a sweet spot for similarity if (match && match.score > 0.92) { console.log("Semantic Cache Hit!"); // The metadata contains the Redis key where the actual answer is stored const cachedData = await redis.get(match.metadata.cacheKey); if (cachedData) return cachedData; } console.log("Cache Miss. Querying LLM..."); const llmResponse = await callLLM(prompt); // Save to Redis with a TTL of 24 hours const cacheKey = `cache:${Date.now()}`; await redis.set(cacheKey, llmResponse, { ex: 86400 }); // Index the embedding in Upstash Vector for future hits await vectorIndex.upsert({ id: cacheKey, // Use the redis key as the vector ID vector: queryEmbedding, metadata: { cacheKey, originalPrompt: prompt }, }); return llmResponse; }

Engineering Trade-offs and Nuances

Implementing this in production requires more than just a simple script. As a senior engineer, you need to account for several edge cases.

Choosing the Right Threshold

The similarity threshold is the most critical dial in your system.

  • Too high (0.98+): You get fewer cache hits because the system requires the prompts to be nearly identical.
  • Too low (< 0.85): You risk "Semantic Drift," where the system returns a cached answer for a question that is related but fundamentally different (e.g., returning the answer for "How do I delete my account?" when the user asked "How do I update my account?").

I recommend starting with 0.90 to 0.95 and logging instances where the score falls in the 0.85–0.90 range for manual review.

Handling Context and Parameters

Prompting is rarely just a raw string. It often includes system instructions, temperature settings, and user-specific context. If your LLM's behavior changes based on a user_role parameter, your cache key must reflect that.

You can handle this by including context in the vector metadata or by creating separate namespaces in Upstash Vector for different user groups or application versions.

The Cost of Embedding

Remember that generating the embedding for the query costs money and time. However, an embedding call to OpenAI is roughly 100x cheaper and 10x faster than a full LLM completion call. You are effectively trading a small, fixed cost for a large, variable saving.

Advanced Strategy: Hybrid Caching

For high-traffic applications, you can implement a Hybrid Cache strategy:

  1. L1: Exact Match (Redis): Check for a direct hash of the prompt. This is sub-millisecond and free of embedding costs.
  2. L2: Semantic Match (Upstash Vector): If L1 misses, generate the embedding and check the vector index.
  3. L3: LLM Inference: If both miss, hit the model.

This layered approach ensures that exact repetitions are handled with maximum efficiency while still capturing semantic variations.

Security and Privacy Considerations

When implementing semantic caching, you must be wary of Cache Poisoning and Data Leakage.

  • Data Leakage: If User A asks a question containing PII (Personally Identifiable Information) and the LLM response is cached, User B might receive that PII if their question is semantically similar.
  • Solution: Never cache responses that contain user-specific data, or ensure the vector search is scoped to the specific user's ID using Upstash Vector's metadata filtering.
// Searching with a user filter const [match] = await vectorIndex.query({ vector: queryEmbedding, topK: 1, filter: `userId = '${currentUser.id}'`, });

Measuring Success

To justify this architecture to stakeholders, you need to track specific metrics:

  1. Cache Hit Rate: The percentage of queries served by the semantic cache.
  2. Latency Reduction: The delta between LLM response time (~2000ms) and Cache response time (~150ms).
  3. Cost Savings: Total tokens saved minus the cost of embedding and vector storage.

In many production environments, we see hit rates between 30% and 60% for common support or FAQ-style queries, which translates directly to a 30-60% reduction in the LLM bill.

Conclusion

Semantic Cache-Aside is no longer an optional optimization; it is a necessity for scaling LLM applications sustainably. By combining Upstash Vector for similarity search and Upstash Redis for reliable storage, you can build a system that gets smarter and faster with every query.

Actionable Next Steps:

  1. Identify your most frequent user queries via LLM logs.
  2. Set up an Upstash Vector index with a dimension size matching your embedding model (e.g., 1536 for OpenAI).
  3. Implement a pilot semantic cache with a conservative threshold (0.95).
  4. Monitor for semantic drift and adjust your threshold based on real-world feedback.