Mastering EDD: Building Resilient RAG Systems with Arize Phoenix and Giskard

In traditional software engineering, we have a clear contract: for a given input, a function should return a predictable output. We write unit tests, set up CI/CD pipelines, and sleep soundly. In the world of Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), that contract is broken. LLMs are non-deterministic, and the 'retrieval' part of RAG introduces a whole new layer of failure points—from poor document chunking to semantic drift.

Most teams start with 'vibes-based development.' They tweak a prompt, run a few queries, and if the response 'looks good,' they ship it. This is the equivalent of 'it works on my machine' but for the generative AI era. To build production-grade RAG systems, we need to move toward Evaluation-Driven Development (EDD).

This article explores how to implement EDD by combining two powerful open-source tools: Arize Phoenix for observability and evaluation, and Giskard for automated vulnerability scanning and unit testing.

The Shift to Evaluation-Driven Development (EDD)

Evaluation-Driven Development is the LLM-native successor to Test-Driven Development (TDD). In TDD, you write a test before the code. In EDD, you define your evaluation metrics and 'Golden Datasets' before you finalize your RAG architecture.

EDD focuses on three pillars:

Retrieval Quality: Is the system finding the right context?
Generation Quality: Is the LLM using that context accurately (Faithfulness)?
Robustness: Does the system resist hallucinations and edge cases?

By using Arize Phoenix and Giskard, we can automate these pillars and turn subjective quality into objective data.

Setting the Foundation: Instrumentation with Arize Phoenix

Before you can evaluate a system, you must be able to see inside it. Arize Phoenix is an open-source observability library that provides OpenTelemetry-based tracing for LLM applications. It allows you to visualize your RAG pipeline—from the initial query to the vector database retrieval, and finally to the LLM response.

Why Tracing Matters

A RAG failure isn't always an LLM failure. Sometimes the retriever returns irrelevant documents (low context precision), or the LLM ignores the correct context (low faithfulness). Without tracing, you are debugging a black box.

To instrument a LangChain or LlamaIndex application with Phoenix, you simply need to initialize the tracer:

import phoenix as px
from phoenix.trace.langchain import LangChainInstrumentor

# Launch the Phoenix server locally
session = px.launch_app()

# Instrument your framework
LangChainInstrumentor().instrument()

Once instrumented, every query you run is captured as a 'trace.' This data becomes the raw material for your evaluation suite.

Automated Vulnerability Scanning with Giskard

While Phoenix helps us observe and evaluate historical traces, Giskard acts as our proactive 'QA Engineer.' Giskard is a specialized testing framework that scans LLM models for hidden risks such as hallucinations, misinformation, bias, and harmful content.

Instead of manually thinking of edge cases, Giskard’s Scanner uses its own 'LLM-as-a-judge' to probe your RAG system. It attempts to find inputs that break your logic.

Implementing a RAG Scan

To use Giskard, you wrap your RAG function into a Giskard Model and run a scan. This is particularly effective for identifying 'Faithfulness' issues—where the model generates an answer that sounds confident but isn't supported by the retrieved documents.

import giskard
import pandas as pd

def model_predict(df: pd.DataFrame):
    return [rag_chain.invoke(question) for question in df["question"]]

giskard_model = giskard.Model(
    model=model_predict,
    model_type="generative",
    name="RAG_Knowledge_Base",
    description="Answers technical questions based on internal documentation",
    feature_names=["question"]
)

# Run the automated scan
report = giskard.scan(giskard_model)
report.to_html("scan_report.html")

The resulting report highlights specific 'slices' of data where the model fails. For example, it might find that the model consistently hallucinates when asked about 'pricing' because the retriever isn't pulling the correct fee schedule documents.

The Core of EDD: Defining RAG Metrics

To automate your unit tests, you need quantitative metrics. The industry standard has converged on the 'RAG Triad.' Both Phoenix and Giskard allow you to calculate these using an LLM-as-a-judge (typically a stronger model like GPT-4o or Claude 3.5 Sonnet).

1. Faithfulness (Groundedness)

Definition: Does the answer come only from the retrieved context? The Test: If the context says "The sky is green" and the model says "The sky is blue," the faithfulness score is 0. This is the primary defense against hallucinations.

2. Answer Relevancy

Definition: Does the answer actually address the user's question? The Test: If the user asks "How do I reset my password?" and the model explains the history of passwords without providing instructions, the relevancy is low.

3. Context Precision

Definition: Are the relevant documents ranked highly in the retrieval results? The Test: If the 'golden' document is at index 10 of the search results, the model is less likely to use it effectively than if it were at index 1.

Building Automated Unit Tests for RAG

Once you have your metrics defined and your scanner has identified potential weaknesses, you can codify these into automated unit tests. This is where we move from experimentation to engineering.

Using Giskard, you can turn scan findings into a 'Test Suite' that runs in your CI pipeline. If a code change (like switching from text-davinci to gpt-4o-mini) causes faithfulness to drop below a threshold, the build fails.

test_suite = giskard.Suite()

# Add a test for faithfulness
test_suite.add_test(giskard.testing.test_faithfulness(model=giskard_model, dataset=test_dataset, threshold=0.8))

# Add a test for answer relevancy
test_suite.add_test(giskard.testing.test_answer_relevancy(model=giskard_model, dataset=test_dataset, threshold=0.9))

# Execute the suite
results = test_suite.run()

Combining the Duo: The Feedback Loop

The real power comes from the integration of these two tools. Here is the professional workflow for a senior engineer:

Development Phase: Use Arize Phoenix locally to trace your RAG chain. Inspect individual spans to ensure your retriever is fetching the right chunks and your prompt templates are being populated correctly.
Discovery Phase: Run a Giskard Scan against your development environment. Identify 'hallucination hotspots'—topics or query types where the model struggles.
Curating the Golden Set: Take the failure cases identified by Giskard and the successful traces from Phoenix and combine them into a 'Golden Dataset.' This is your ground truth.
CI/CD Integration: Wrap the Golden Set in a Giskard Test Suite. Every PR that modifies the prompt, the embedding model, or the retrieval logic must pass this suite.
Production Monitoring: Once deployed, use Phoenix in production to export live traces. If you see a cluster of low-confidence scores in production, feed those back into Giskard to generate new unit tests.

Solving the Hallucination Problem

Hallucinations in RAG usually stem from two sources: 'Missing Information' (The retriever failed) or 'Information Overload' (The retriever found too much noise, and the LLM got confused).

By using Phoenix's Umap (Uniform Manifold Approximation and Projection) visualizations, you can see 'clusters' of queries that result in low faithfulness. If these queries are clustered in a region of your vector space where you have no data, you know you have a retrieval gap. If they are scattered, you likely have a prompt engineering or model reasoning issue. Giskard then allows you to create specific 'adversarial' tests for those clusters to ensure that if the data is missing, the model learns to say "I don't know" rather than making up an answer.

Actionable Conclusion

Transitioning from 'vibes' to 'evals' is the single most important step in maturing an AI engineering practice. To get started:

Stop manual testing: Install phoenix and giskard today.
Instrument immediately: Add Phoenix tracing to your local dev environment to see what your RAG system is actually doing.
Generate a Baseline: Run a Giskard scan on your current system to find your 'hallucination rate.'
Automate the Gate: Create a Golden Dataset of at least 50-100 question-context-answer triplets and integrate a Giskard Test Suite into your CI/CD pipeline.

By treating LLM outputs as measurable data rather than magic, you build systems that aren't just impressive in a demo, but reliable in production.