Scaling LLMs: Implementing Semantic Caching with Redis

Building Retrieval-Augmented Generation (RAG) systems has become the standard for bringing proprietary data into Large Language Models (LLMs). However, as these systems move from prototype to production, two familiar enemies emerge: latency and cost. Every call to a high-reasoning model like GPT-4 or Claude 3.5 Sonnet incurs a financial cost per token and a time cost that can range from two to ten seconds.

In traditional web development, we solve this with a cache. If a user requests the same resource twice, we serve it from memory. But LLMs present a unique challenge. In a natural language interface, two queries are rarely identical at the string level, even if they mean the exact same thing. This is where semantic caching changes the game. By using vector similarity search, we can identify when a new query is semantically equivalent to a previous one and serve the cached response, slashing latency to milliseconds and token costs to zero.

The Failure of Exact-Match Caching

Traditional caching relies on key-value pairs where the key is typically a hash of the input. In a standard REST API, GET /user/123 always maps to the same result. In an AI context, consider these three user prompts:

"How do I reset my password?"
"I forgot my password, what's the process to change it?"
"Password reset instructions, please."

To a human, these are identical. To a standard Redis GET command using the string as a key, these are three distinct misses. If your RAG system processes 10,000 queries a day and 30% of them are variations of common questions, you are overpaying for those 3,000 queries and forcing your users to wait unnecessarily.

Semantic caching moves the lookup from the keyword domain to the vector domain. Instead of looking for an exact string match, we look for an embedding that is "close enough" in multi-dimensional space.

The Architecture of a Semantic Cache

A semantic cache sits between your application logic and your LLM provider. The workflow follows a specific sequence:

Input Vectorization: The user's query is converted into a vector embedding using a model like text-embedding-3-small.
Vector Search: We query a vector database (like Redis) to find the nearest neighbor to this embedding.
Distance Evaluation: We calculate the distance (often Cosine Similarity or Euclidean Distance) between the query and the best match.
Cache Hit/Miss Logic: If the distance is below a predefined threshold (e.g., 0.1), we return the cached response. If not, we proceed to the LLM.
Cache Population: On a miss, the LLM's response is stored in the cache along with the original query's embedding for future use.

Why Redis for Semantic Caching?

While there are many vector databases available, Redis is uniquely suited for semantic caching for several reasons:

Speed: As an in-memory data store, Redis offers sub-millisecond lookups. When your goal is to reduce latency, the cache itself must be as fast as possible.
Unified Tooling: Most enterprise stacks already use Redis for session management or standard caching. Adding RediSearch (the module providing vector capabilities) avoids introducing yet another piece of infrastructure.
Flexibility: Redis allows you to store the embedding, the original prompt, the metadata, and the LLM response in a single HASH or JSON document, making retrieval straightforward.

Implementing the Solution

Let’s walk through a conceptual implementation using Python and the redis-py client. We will assume you have a Redis instance running with the RediSearch module enabled.

1. Setting up the Index

First, we need to define an index in Redis that can handle vector fields. We'll use the HNSW (Hierarchical Navigable Small World) algorithm for efficient similarity searching.

import redis
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType

client = redis.Redis(host='localhost', port=6379, decode_responses=True)

# Configuration for the vector index
INDEX_NAME = "semantic_cache"
VECTOR_DIM = 1536  # Dimension for OpenAI embeddings

schema = (
    TextField("prompt"),
    TextField("response"),
    VectorField("embedding", "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": VECTOR_DIM,
        "DISTANCE_METRIC": "COSINE"
    })
)

try:
    client.ft(INDEX_NAME).create_index(schema, definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH))
except:
    print("Index already exists")

2. The Search Logic

When a query comes in, we embed it and search the index. The k=1 parameter ensures we only look for the single most relevant match.

import numpy as np
from redis.commands.search.query import Query

def get_cached_response(query_embedding, threshold=0.15):
    # Prepare the query
    # [VECTOR_RANGE] is used to find neighbors within a certain distance
    q = Query("*=>[KNN 1 @embedding $vec as score]")\
        .sort_by("score")\
        .return_fields("prompt", "response", "score")\
        .dialect(2)
    
    params = {"vec": np.array(query_embedding, dtype=np.float32).tobytes()}
    results = client.ft(INDEX_NAME).search(q, params)
    
    if results.docs:
        best_match = results.docs[0]
        score = float(best_match.score)
        
        # In Redis Cosine Distance, 0 is identical, 1 is orthogonal
        if score <= threshold:
            return best_match.response
            
    return None

3. Handling a Cache Miss

If get_cached_response returns None, we call our LLM and then save the result.

def save_to_cache(prompt, response, embedding):
    key = f"cache:{hash(prompt)}"
    client.hset(key, mapping={
        "prompt": prompt,
        "response": response,
        "embedding": np.array(embedding, dtype=np.float32).tobytes()
    })

The Threshold Problem: Precision vs. Recall

The most critical part of semantic caching is the similarity threshold.

Too strict (e.g., 0.05): You will experience many cache misses for queries that were essentially the same, reducing the ROI of the cache.
Too loose (e.g., 0.30): You risk "semantic drift," where the cache returns an answer to a different question. For example, a user asking "How do I delete my account?" might get the cached answer for "How do I update my account?"

In my experience, the optimal threshold is highly dependent on the embedding model used. For OpenAI's text-embedding-3-small, a cosine distance between 0.1 and 0.15 is usually the sweet spot for general Q&A. I recommend logging the distance of every cache hit during a pilot phase to determine where the quality starts to degrade.

Advanced Strategies for Production

TTL and Cache Invalidation

In a RAG system, your underlying data changes. If your documentation is updated, your cached answers might become obsolete.

Standard Redis TTL (Time To Live) works here. You can set an expiration on the cache keys (e.g., 24 hours). However, a more sophisticated approach is to clear the cache whenever the RAG knowledge base is updated. Since semantic caching is often tied to specific versions of your data, consider including a data_version tag in your metadata and filtering your vector search by that version.

Evaluation and Feedback Loops

To ensure the cache isn't hallucinating or providing outdated info, implement a simple "thumbs up/down" on the UI. If a user gives a thumbs down to a cached response, you can programmatically delete that entry from Redis to ensure it isn't served to the next user.

Security and Privacy

Caching introduces a potential security risk: PII (Personally Identifiable Information) leakage. If User A asks about their specific billing issue and the response contains their name or account number, you do not want that response cached and served to User B.

Strategy: Never cache responses that contain sensitive user data. You can use a PII detection layer or, more simply, only enable semantic caching for "Global" or "General" knowledge queries, while bypassing the cache for user-specific data lookups.

Measuring Success

When you implement semantic caching, you should track three primary KPIs:

Cache Hit Rate: The percentage of queries served by Redis. A healthy RAG system usually sees 20-40%.
Latency Reduction: Compare the P99 latency of LLM calls (often >2000ms) vs. cache hits (often <50ms).
Cost Savings: Calculate the tokens saved per day. In high-volume systems, this can easily amount to thousands of dollars per month.

Conclusion

Semantic caching is no longer an optional optimization; it is a necessity for production-grade AI applications. By leveraging Redis and vector similarity search, we bridge the gap between the rigid nature of traditional caching and the fluid nature of human language.

To get started:

Audit your logs: Identify the most common redundant queries in your system.
Prototype with Redis: Use the RediSearch module to build a simple vector index.
Benchmark your threshold: Start with a strict threshold and loosen it as you gain confidence in the similarity matches.

By moving the heavy lifting from the LLM to an in-memory vector store, you provide a snappier experience for your users and a more sustainable bottom line for your business.