Optimizing LLM Cost and Latency with Redis Semantic Caching

As LLMs move from experimental prototypes to production-ready services, engineering teams are hitting two major roadblocks: the 'LLM tax' (cost) and the 'latency wall.' Every call to a flagship model like GPT-4 or Claude 3 Opus incurs a financial cost and a significant delay, often measured in seconds.

While traditional caching is a staple of web architecture, it has historically failed in the context of Natural Language Processing (NLP). If one user asks "How do I reset my password?" and another asks "I forgot my password, how do I change it?", a traditional key-value cache treats these as entirely different requests. Semantic caching changes this by focusing on intent rather than syntax.

In this article, we will explore how to build a production-grade semantic cache using Redis and vector embeddings to slash costs and bring P99 latencies down to milliseconds.

The Limitations of Exact-Match Caching

Standard caching mechanisms (like a basic Redis GET/SET) rely on string equality. They hash the input string and use it as a key. This is highly efficient for REST APIs or database queries where the input is deterministic.

However, natural language is non-deterministic. There are infinite ways to phrase the same question. If your cache hit rate for an LLM-powered chatbot is based on exact string matching, your hit rate will likely hover near zero. You end up paying for the same computation over and over again, simply because of a comma or a synonym.

Understanding Semantic Caching

Semantic caching uses vector embeddings to represent the 'meaning' of a query as a series of coordinates in high-dimensional space. Instead of checking if two strings are identical, we check if two vectors are geographically close to one another.

The Workflow

Query Arrival: A user submits a prompt.
Embedding Generation: The prompt is sent to an embedding model (e.g., OpenAI’s text-embedding-3-small or a local HuggingFace model).
Vector Search: The resulting vector is used to query a vector database (Redis) to find the nearest neighbor.
Threshold Evaluation: We calculate the distance (cosine similarity or Euclidean distance) between the new query and the cached query.
Cache Hit: If the distance is below a specific threshold (e.g., 0.1), we return the cached response.
Cache Miss: If no close match is found, we call the LLM, store the result and the embedding in Redis, and return the response to the user.

Why Redis for Semantic Caching?

Most developers already use Redis as a traditional cache or message broker. With the introduction of Redis Search and Query features (formerly RediSearch), Redis has evolved into a highly performant vector database.

Using Redis for semantic caching offers several advantages:

Performance: Being an in-memory store, Redis provides sub-millisecond search latencies, which is critical when the goal is to avoid a 2-second LLM call.
Simplicity: You don't need to introduce a new specialized vector database into your stack if you already have Redis.
Hybrid Queries: You can combine vector search with traditional metadata filtering (e.g., "Find a similar question but only within the 'billing' category").

Implementing the Solution

To implement this, we need a Redis instance with the search module enabled and a Python environment with the redis-py and openai libraries.

1. Initializing the Redis Schema

First, we define our index. We need to store the original prompt, the LLM response, and the vector representation of the prompt.

import redis
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.index_definition import IndexDefinition, IndexType

r = redis.Redis(host='localhost', port=6379, decode_responses=True)

# Define the index schema
schema = (
    TextField("prompt"),
    TextField("response"),
    VectorField("embedding", "HNSW", {
        "TYPE": "FLOAT32", 
        "DIM": 1536, 
        "DISTANCE_METRIC": "COSINE"
    })
)

# Create the index
try:
    r.ft("idx:cache").create_index(schema, definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH))
except:
    print("Index already exists")

In this example, we use the HNSW (Hierarchical Navigable Small World) algorithm. It is generally preferred over FLAT indexing for production use cases because it provides faster search times at the cost of a small amount of memory and accuracy.

2. The Semantic Lookup Logic

When a query comes in, we must convert it to an embedding and search the index. The threshold for what constitutes a "hit" is the most critical tuning parameter in your system.

from openai import OpenAI
client = OpenAI()

def get_embedding(text):
    return client.embeddings.create(input=[text], model="text-embedding-3-small").data[0].embedding

def check_cache(query_text, threshold=0.15):
    query_vector = get_embedding(query_text)
    
    # Prepare the Redis vector query
    # We look for the 1 nearest neighbor within the specified distance
    query = (
        Query("*=>[KNN 1 @embedding $vec AS score]")
        .sort_by("score")
        .return_fields("prompt", "response", "score")
        .dialect(2)
    )
    
    params = {"vec": np.array(query_vector, dtype=np.float32).tobytes()}
    results = r.ft("idx:cache").search(query, params)

    if results.docs:
        score = float(results.docs[0].score)
        if score <= threshold:
            return results.docs[0].response
    
    return None

The "Goldilocks" Threshold Problem

Setting the similarity threshold is an exercise in trade-offs:

Too Strict (Low Threshold): You will have many cache misses. You ensure high accuracy, but you lose the cost and latency benefits.
Too Loose (High Threshold): You will have high cache hits, but you risk returning irrelevant or incorrect answers. If a user asks "How do I delete my account?" and the cache returns the answer for "How do I create an account?", the user experience is ruined.

I recommend starting with a threshold based on Cosine Distance. In many OpenAI embedding use cases, a distance of 0.1 to 0.2 is a safe starting point. However, you should log your "near misses" and manually review them to calibrate this number for your specific domain.

Advanced Strategies for Production

TTL and Cache Invalidation

Unlike traditional web data, LLM responses might become "stale" if the underlying product changes. You should implement a Time-To-Live (TTL) on your cache entries. In Redis, you can set an expiration on the hash keys.

However, be careful: if you use a vector index, ensure your indexing strategy handles expired keys gracefully. Redis handles this automatically—when a key expires, it is removed from the index.

Handling PII and Security

Caching prompts poses a security risk if multiple users share the same cache. If User A asks "What is my current balance?" and the LLM responds with a specific dollar amount, you do not want that cached response served to User B.

The Rule: Only cache generic, non-personalized queries. You can achieve this by adding a user_id or is_private flag to your Redis schema and including it in your search filter:

# Example of filtering by user_id to ensure privacy
query = Query("(@user_id:{123})=>[KNN 1 @embedding $vec AS score]")

Evaluation and Monitoring

You cannot "set and forget" a semantic cache. You need to monitor your Cache Hit Ratio and Semantic Drift. Use a tool like LangSmith or custom Prometheus metrics to track how often the cache is used and whether users are reporting the cached responses as unhelpful.

Real-World Impact: A Case Study

Consider a technical support bot for a SaaS platform.

Average LLM Latency: 2.4 seconds
Average LLM Cost: $0.01 per interaction
Cache Hit Rate: 35% (typical for support bots)
Redis Latency: 5ms (including embedding generation, which takes ~80ms)

For 1,000,000 queries:

Without Cache: $10,000 cost, 666 hours of total user wait time.
With Semantic Cache: $6,500 cost, 433 hours of total user wait time.

This represents a 35% reduction in COGS and a massive improvement in the perceived snappiness of the application for over a third of your users.

Conclusion: Your Action Plan

Implementing semantic caching is one of the highest-ROI tasks a platform engineer can undertake when scaling GenAI features. It directly impacts the bottom line and the user experience simultaneously.

To get started:

Audit your current LLM logs: Identify how many queries are semantically similar.
Prototype with RedisVL: Use the Redis Vector Library (Python) to simplify the boilerplate of index creation and searching.
Start Strict: Set a very low distance threshold (e.g., 0.05) and gradually loosen it as you gain confidence in the similarity matches.
Implement Metadata Filtering: Ensure you aren't serving cached answers across security boundaries (e.g., different organizations or user roles).

By treating your LLM prompts as searchable vectors rather than static strings, you move from a naive integration to a sophisticated, production-grade AI architecture.