Testing RAG: Building LLM Eval Pipelines with DeepEval and Pytest

The most significant challenge in moving Large Language Model (LLM) applications from prototype to production isn't the model itself—it's the uncertainty of the output. In traditional software engineering, we rely on deterministic unit tests. If sum(2, 2) doesn't return 4, the build fails.

In the world of Retrieval-Augmented Generation (RAG), things are murkier. You might ask your documentation bot a question, and it gives a perfect answer today. Tomorrow, due to a slight change in the retrieval context or a model update, it might hallucinate a feature that doesn't exist. Relying on "vibe checks"—manual spot-checking of responses—is the fastest way to ship a broken product. To build production-grade AI, we need to treat LLM outputs as measurable data points.

This guide explores how to implement a robust, automated evaluation pipeline using DeepEval and Pytest to quantify faithfulness and relevancy within your CI/CD workflow.

The "Vibe Check" Bottleneck

Most teams start their RAG journey by manually testing a few prompts in a playground. This works for a PoC, but it fails at scale for three reasons:

Regressions are invisible: Optimizing your chunking strategy might improve one answer but break ten others.
Subjectivity: What looks like a "good" answer to one developer might be missing critical technical nuance for another.
Non-determinism: LLMs are probabilistic. A single manual check doesn't account for the variance in model responses.

To solve this, we need an automated way to grade LLM responses against specific metrics, specifically targeting the "RAG Triad": Faithfulness, Answer Relevancy, and Contextual Precision.

Why DeepEval and Pytest?

DeepEval is an open-source testing framework for LLMs that operates on the principle of "LLM-as-a-judge." It uses highly capable models (like GPT-4o) to evaluate the outputs of your smaller or specialized production models.

By pairing DeepEval with Pytest, we gain several advantages:

Developer Familiarity: Most Python developers already know how to use Pytest.
CI/CD Integration: Pytest exits with standard error codes that tools like GitHub Actions and GitLab CI understand.
Modular Testing: You can run your LLM evals alongside your standard unit tests.

Defining the Core Metrics

Before writing code, we must define what we are measuring. In a RAG pipeline, two metrics are paramount:

1. Faithfulness (Anti-Hallucination)

Faithfulness measures whether the LLM's response is derived solely from the retrieved context. If the LLM claims your software supports OAuth2, but the retrieved documentation snippet never mentions OAuth2, the faithfulness score drops. This is the primary tool for catching hallucinations.

2. Answer Relevancy

Relevancy measures how well the answer addresses the user's prompt. An answer can be 100% faithful to the context but completely irrelevant to what the user actually asked.

Practical Implementation: Setting up the Pipeline

Let's build a test suite for a hypothetical technical support bot. First, install the necessary dependencies:

pip install deepeval pytest

Step 1: The RAG Component

Assume we have a simple RAG function that takes a query and returns a generated answer and the retrieved context.

# rag_app.py
def query_docs(prompt):
    # Imagine this calls your vector DB and LLM
    context = ["Our API supports JWT authentication via the Authorization header."]
    response = "You can authenticate using JWT in the header."
    return response, context

Step 2: Writing the DeepEval Test

Create a file named test_rag.py. We will use DeepEval’s assert_test to validate our metrics.

import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from rag_app import query_docs

def test_rag_faithfulness():
    # 1. Setup the inputs
    user_input = "How do I authenticate?"
    actual_output, retrieval_context = query_docs(user_input)

    # 2. Define the metric (threshold of 0.7 means we allow some minor noise)
    metric = FaithfulnessMetric(threshold=0.7)
    
    # 3. Create a test case
    test_case = LLMTestCase(
        input=user_input,
        actual_output=actual_output,
        retrieval_context=retrieval_context
    )

    # 4. Execute the assertion
    assert_test(test_case, [metric])

Step 3: Running the Tests

Run the test using the standard Pytest command:

export OPENAI_API_KEY="your-key"
pytest test_rag.py

DeepEval will output a detailed breakdown of why a test passed or failed. If the LLM hallucinated, DeepEval provides the "Reasoning," explaining exactly which part of the output wasn't supported by the context.

Automating Evaluations in CI/CD

The real power of this setup is realized when it's integrated into your deployment pipeline. You can prevent a "hallucination-prone" model version from ever reaching production.

Here is a sample GitHub Actions workflow (.github/workflows/evals.yml):

name: LLM Evaluation

on:
  pull_request:
    branches: [ main ]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install deepeval pytest

      - name: Run DeepEval Tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: pytest test_rag.py

In this configuration, if your RAG system's faithfulness score drops below your defined threshold, the CI build fails, the PR is blocked, and your team is notified. This shifts the quality assurance of AI from post-deployment monitoring to pre-deployment prevention.

Advanced Concept: Synthetic Data Generation

One common hurdle is the lack of a "Golden Dataset"—a set of high-quality queries and ground-truth answers. DeepEval can help here by generating synthetic test cases from your existing documents.

By pointing DeepEval at your PDF or Markdown files, it can use an LLM to generate dozens of likely user questions and their corresponding ideal contexts. This allows you to scale your test suite from 5 hand-written tests to 500 automated tests in minutes.

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
    document_paths=['docs/api-guide.md'],
    max_goldens=20
)

The Cost and Latency Trade-off

As a senior engineer, you must consider the trade-offs. Running GPT-4o as a judge for every single commit can be expensive and slow. To optimize your pipeline:

Sample your tests: Don't run the full 500-test suite on every commit. Run a "smoke test" of 10 core queries on every PR, and run the full suite nightly.
Use smaller models for judging: For simpler metrics, a fine-tuned GPT-3.5 or an open-source model like Llama 3 can often suffice as a judge, significantly reducing costs.
Parallelization: Use pytest-xdist to run your evaluations in parallel. Since LLM evals are I/O bound (waiting for API responses), parallelization can reduce test time from minutes to seconds.

Best Practices for RAG Evals

Version your data: Your evaluation is only as good as your test data. Version your "Golden Dataset" alongside your code.
Analyze the "Why": Don't just look at the pass/fail. DeepEval provides a reason attribute. Log these reasons to a tool like LangSmith or Weights & Biases to identify patterns in model failure.
Iterate on Thresholds: Start with conservative thresholds (e.g., 0.6) and gradually increase them as you refine your prompt engineering and retrieval logic.
Separate Retrieval from Generation: Occasionally, a test fails because the retriever failed to find the right document, not because the LLM hallucinated. Use ContextualPrecisionMetric to isolate retrieval issues.

Actionable Conclusion

Moving from "vibe checks" to automated pipelines is the single most important step in professionalizing LLM development. By implementing DeepEval and Pytest, you transform a subjective process into a quantifiable engineering discipline.

To get started today:

Identify the top 10 most critical queries for your RAG system.
Write a Pytest script using DeepEval's FaithfulnessMetric for these queries.
Integrate that script into your CI pipeline.
Set a threshold that reflects your risk tolerance and iterate as your dataset grows.

Stop guessing if your LLM is working. Start measuring it.