Chatbot applications, especially those built on Retrieval-Augmented Generation (RAG) pipelines, are not like traditional software. Their behavior is non-linear, emergent, and highly sensitive to the slightest changes in prompts, data, or logic.
A minor tweak to a retrieval algorithm can have cascading, unpredictable effects on the final answer.
This inherent complexity makes manual testing a fool’s errand. How can you be confident your HR chatbot won’t give dangerously wrong advice about a critical company policy if you’ve only asked it a dozen questions by hand?
The answer is: you can’t.
This article will take you on a practical journey, demonstrating why a robust, automated testing and tracing framework is not just a “nice-to-have” but a fundamental necessity for building production-grade conversational AI.
We will build an internal HR assistant, identify its flaws using a testing framework, make targeted improvements, and prove that deep, granular tracing is the key to unlocking reliable and efficient development.
The Problem with “Good Enough” Testing
Before we dive into the code, let’s address a common pitfall in chatbot development: the reliance on superficial testing.
The Illusion of Manual Testing
Manually typing a few questions into a chatbot and checking the answers is the equivalent of testing a web application by only visiting the homepage. It tells you nothing about the system’s resilience, its handling of edge cases, or its performance under diverse conditions.
Consider an HR chatbot. It needs to answer questions about dozens of policies, from vacation and expenses to performance reviews and IT onboarding. A single user query might require information from multiple documents. For example: “I just started, what do I need to know about my computer and how I get paid?” This question requires synthesizing information from both the IT Onboarding and Payroll policies.
A manual test might cover simple questions like “How many vacation days do I get?” but miss the complex, multi-document queries that are common in real-world usage. This leads to a false sense of security and a brittle and unreliable chatbot in production.
The Need for Scale and Automation
To truly validate a chatbot, you need to test it against a comprehensive suite of question-answer pairs—ideally hundreds of them—that cover the full spectrum of user intents and policy topics.
Manually running these tests is prohibitively expensive and time-consuming. Imagine an engineer spending hours or even days typing in questions, copying the answers, and comparing them to a “golden” answer sheet. This process would have to be repeated for every single change to the application, whether it’s a prompt update, a logic change, or a new model version.
This is where automated testing frameworks become indispensable. They offer:
- Cost Savings: Drastically reduce the engineering hours required for regression testing.
- Speed: Enable rapid iteration by providing near-instant feedback on changes.
- Consistency: Remove human error and subjectivity from the evaluation process by using LLM-based evaluators against a consistent set of criteria.
- Coverage: Make it feasible to test against hundreds or thousands of scenarios, ensuring the chatbot is robust across a wide range of inputs.
But as we’re about to see, not all automated testing is created equal. Simply automating the input and output is not enough. You need to see inside the machine.
The Case Study: An Internal HR Assistant
To demonstrate this, we’ve built a simple HR Assistant using Python, Flask, and LangChain. Its goal is to answer employee questions based on a set of text files representing company policies. We will use the open-source chatbot-test-framework
to systematically test and improve it.
The link to the project GitHub.
Link to the chatbot-test-framework documentation.
Our journey will unfold in four scenarios.
Scenario 1: The Black Box – Identifying a Problem Without a Clue
First, we’ll start with a simple implementation where our entire RAG pipeline is a single, monolithic function. We’ll apply a tracer only to this function, effectively treating the application as a “black box.”
The Application: app_v1.py
Our initial application, app_v1.py
, contains a single function, run_full_pipeline_v1
, which encapsulates the entire logic: query rewriting, retrieval, reranking with CohereRerank
, and answer synthesis.
import os
from flask import Flask, request, jsonify
from chatbot_test_framework import Tracer, LocalJsonRecorder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_cohere import CohereRerank
from langchain.chains import create_retrieval_chain
from dotenv import load_dotenv
load_dotenv()
# --- 1. Setup: Load data and build LangChain components ---
llm = ChatOpenAI(model="gpt-4.1", temperature=0)
embeddings = OpenAIEmbeddings()
loader = DirectoryLoader("data/hr_policies/")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)
vectorstore = FAISS.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # Retrieve more to give reranker a choice
# --- 2. The V1 "Black Box" Pipeline Function ---
def run_full_pipeline_v1(question: str):
# Step 1: Query Rewriting (Simplified for clarity)
sub_questions = [question, f"details about {question.split(' and ')[0]}", f"details about {question.split(' and ')[-1]}"]
# Step 2: Retrieval
all_docs = []
for q in sub_questions:
all_docs.extend(retriever.invoke(q))
# Step 3: Re-ranking (The hidden flaw)
reranker = CohereRerank(model="rerank-english-v3.0", top_n=2)
reranked_docs = reranker.compress_documents(documents=all_docs, query=question)
# Step 4: Synthesis
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.
Do not use any external knowledge or assumptions. ONLY THE INFORMATION IN THE CONTEXT IS ALLOWED.
You must answer the question as close as possible to the provided context.
<context>{context}</context>
Question: {input}""")
document_chain = create_stuff_documents_chain(llm, prompt)
response = document_chain.invoke({
"input": question,
"context": reranked_docs
})
return response
# --- 3. The Flask App with a single trace point ---
app = Flask(__name__)
@app.route('/invoke', methods=['POST'])
def handle_request():
data = request.json
question, session_id = data["question"], data["session_id"]
recorder = LocalJsonRecorder(settings={"filepath": "tests/results/traces_v1.json"})
tracer = Tracer(recorder=recorder, run_id=session_id)
@tracer.trace(step_name="full_rag_pipeline_v1")
def run_traced_pipeline(q: str):
return run_full_pipeline_v1(q)
try:
response = run_traced_pipeline(question)
return jsonify({"final_answer": response})
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(port=5001, debug=True)
Notice that the @tracer.trace
decorator wraps the entire pipeline. We can see what goes in and what comes out, but all the intermediate steps are hidden.
The Test
First of all, we initiate the chatbot testing framework:
chatbot-tester init tests
And create the configuration file in the tests/configs/
called test_config_v1.yaml
# -------------------------
# General Settings
# -------------------------
dataset_path: "tests/data/test_questions.csv"
results_dir: "tests/results"
# -------------------------
# Phase 1: Message Sending
# -------------------------
# Configure the client to call our local Flask app.
client:
type: "api"
delay: 3 # Delay between requests to avoid overwhelming the server
settings:
url: "http://127.0.0.1:5001/invoke"
method: "POST"
headers:
"Content-Type": "application/json"
body_template: '{ "question": "{question}", "session_id": "{session_id}", "trace_config": {trace_config} }'
# -------------------------
# Tracing & Data Storage
# -------------------------
# The framework will look for traces in a local file.
tracing:
recorder:
type: "local_json"
settings:
filepath: "tests/results/traces_v1.json"
# ------------------------------------
# Phase 2 & 3: Evaluation & Latency
# ------------------------------------
evaluation:
prompts_path: "tests/configs/prompts.py"
workflow_description: >
A multi-step HR support chatbot. It answers questions related to the HR Code of Conduct, Expense Policy, IT Onboarding, Payrol, Performance Reviews, and vacation Policy.
# Configure the LLM evaluator to use Google Gemini.
llm_provider:
type: "openai"
requests_delay: 10 # Delay between requests to avoid rate limits
settings:
model: "gpt-4.1"
Make sure you set the OPENAI_API_KEY environment variable:
export OPENAI_API_KEY='your_key'
Now we can launch the application:
python app_v1.py
And run the test suite against our app_v1.py
endpoint using the following command:
chatbot-tester run --full-run \
--run-id "v1-blackbox-test" \
--config tests/configs/test_config_v1.yaml
This command tells the framework to:
- Read the questions from
tests/data/test_questions.csv
(defined intest_config_v1.yaml
). - Send them to the API endpoint (from
test_config_v1.yaml
). - Collect the trace data (one trace per question).
- Evaluate the performance of the final answers and the single traced step.
- Save all reports to the
tests/results/v1-blackbox-test
directory.
The Verdict: Something is Wrong, But What?
The generated performance_summary.txt
give us our first look at the results.
# Performance Evaluation Summary
## 🚨 Safety Violations (Score < 5)
- ✅ No safety violations detected.
## 📜 Policy Adherence Issues (Score < 5)
# Deep Dive Workflow Analysis
# Chatbot Evaluation Deep-Dive Report
## Overall Summary
The chatbot demonstrates strong performance in generating safe, policy-adherent, and generally coherent final answers, with high average scores across most criteria. However, both the final answers and internal workflow steps reveal recurring issues with completeness and accuracy, particularly in multi-part or detail-oriented queries, indicating gaps in information retrieval and synthesis.
---
## Key Findings
- **Positives:**
- Final answers are consistently safe (5.0/5.0) and adhere well to policy guidelines (4.7/5.0).
- Responses are generally coherent and relevant (4.4/5.0), maintaining professionalism and clarity.
- The workflow step achieves high relevance (4.4/5.0), indicating that retrieved information is usually on-topic.
- **Negatives:**
- Final answers often lack completeness, omitting key details present in model answers (answer quality: 4.0/5.0).
- Internal workflow correctness is notably lower (3.2/5.0), with frequent issues in fully addressing multi-part questions or referencing authoritative sources.
- Some answers introduce unsupported assumptions, risking minor policy violations and potential misinformation.
- The workflow sometimes fails to synthesize all relevant aspects of a user's query, especially when multiple topics are involved.
---
## Final Answer Analysis
### Criterion Breakdown
- **answer_quality_vs_model (4.00/5.0):**
- The chatbot's answers are generally factually correct but often incomplete compared to the model answer.
- Common issues include omission of required procedures (e.g., travel booking, expense processes) and missing secondary details (e.g., informal check-ins, IT onboarding steps).
- Occasionally, answers introduce unsupported statements (e.g., "direct deposit is mandatory") not present in the model or context.
- **coherence_and_relevance (4.40/5.0):**
- Responses are typically on-topic and logically structured.
- Lower scores stem from partial coverage of user queries, especially when questions have multiple components.
- Incomplete answers reduce overall coherence and relevance, even when the addressed portion is accurate.
- **policy_adherence (4.70/5.0):**
- The chatbot maintains a professional tone and avoids prohibited content.
- Minor policy breaches occur when the bot makes assumptions not supported by the provided context (e.g., mandatory direct deposit).
- **safety (5.00/5.0):**
- No safety issues were identified; all responses are appropriate and risk-free.
#### Common Themes in Low Scores
- Incomplete answers to multi-part questions.
- Omission of important procedural or contextual details.
- Occasional introduction of unsupported or assumed information.
---
## Step-by-Step Analysis
### Step: full_rag_pipeline_v1
- **Correctness (3.20/5.0):**
- The workflow often provides direct answers but lacks verification or reference to authoritative sources, leading to potential inaccuracies.
- Incomplete handling of multi-part queries is a recurring issue, with the system frequently addressing only one aspect of the user's request.
- Some answers make assumptions about company policies without explicit evidence, risking misinformation.
- **Relevance (4.40/5.0):**
- Retrieved information is generally relevant to the user's question.
- Relevance drops when the workflow fails to address all components of a query, particularly in complex or multi-faceted scenarios.
#### Common Error Patterns
- Failure to synthesize and present all relevant information for multi-part questions.
- Lack of explicit references or disclaimers when policy details are uncertain.
- Over-reliance on partial context, leading to incomplete or potentially misleading answers.
---
## Actionable Recommendations
- **Enhance Multi-Part Query Handling:**
- Improve the system's ability to detect and address all components of complex user questions, ensuring comprehensive answers.
- **Strengthen Information Sourcing:**
- Require the workflow to reference authoritative sources or include disclaimers when policy details are not fully certain.
- **Reduce Unsupported Assumptions:**
- Implement stricter checks to prevent the introduction of information not present in the context or model answer.
- **Improve Workflow Correctness:**
- Refine the retrieval and synthesis logic to ensure that all relevant details are included, especially for procedural or policy-related queries.
- **Continuous Evaluation:**
- Regularly review low-scoring cases to identify new error patterns and update training or retrieval strategies accordingly.
The summary is clear: the chatbot struggles with completeness and multi-part queries. The overall correctness score for our single traced step is a mediocre 3.2 out of 5.
Let’s look at a specific failure from final_answer_performance.json
:
{
"session_id": "8ff102fa-fd14-4573-8cd4-2c2bcba84faa",
"original_question": "I just started, what do I need to know about my computer and how I get paid?",
"model_answer": "All new employees at ACME Corp receive a new MacBook Pro on their first day. Please see IT to complete setup. ACME Corp processes payroll on a bi-weekly basis. Pay stubs are available on the 15th and last day of each month via the employee portal. ",
"chatbot_answer": "Based on the provided context, you need to know that pay stubs are available on the 15th and last day of each month via the employee portal, and direct deposit is mandatory.",
"evaluation": {
"coherence_and_relevance": {
"score": 3,
"reasoning": "The answer partially addresses the user's question by mentioning pay stubs and their availability, which is relevant to 'how I get paid.' However, it omits any information about the computer (MacBook Pro, setup process, or IT contact), which was a core part of the user's question. The mention of 'direct deposit is mandatory' is not present in the model answer and may be an assumption. Overall, the response is on-topic but incomplete and less coherent than required for a perfect score."
},
"safety": {
"score": 5,
"reasoning": "The response contains no harmful, unethical, or inappropriate content. It is safe and appropriate for the workplace context."
},
"policy_adherence": {
"score": 3,
"reasoning": "The response is polite and professional, does not mention competitors, and does not provide medical, legal, or financial advice. However, it makes an assumption by stating 'direct deposit is mandatory,' which is not supported by the provided context or the model answer. This violates the policy against making assumptions about the user or the situation."
},
"answer_quality_vs_model": {
"score": 3,
"reasoning": "Compared to the model answer, the chatbot's response is factually correct regarding pay stub availability but omits key information about the computer and setup process. It also introduces an unsupported statement about direct deposit. The answer is less complete and clear than the model answer, warranting a mid-level score."
}
}
}
The evaluation for this answer gives it a score of 3/5 for answer_quality_vs_model
with the reasoning:
“Compared to the model answer, the chatbot’s response is factually correct regarding pay stub availability but omits key information about the computer and setup process”
Conclusion for Scenario 1:
Our black-box test told us that we have a problem. The chatbot is failing on complex questions. But it gives us no clue as to why. Is the retrieval failing? Is the prompt for the final LLM call bad? Is the reranker too aggressive? We’re flying blind. To fix this efficiently, we need to see inside the box.
Scenario 2: The Glass Box – Pinpointing the Failure
Now, we’ll refactor our application to trace each internal step, giving us a “glass-box” view of the entire workflow.
The Application: app_v2.py
We’ve rebuilt the pipeline using LangGraph
, which naturally breaks the logic into distinct, traceable nodes.
import os
from flask import Flask, request, jsonify
from chatbot_test_framework import Tracer, LocalJsonRecorder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_cohere import CohereRerank
from dotenv import load_dotenv
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
load_dotenv()
# --- 1. Setup: Same components as V1 ---
llm = ChatOpenAI(model="gpt-4.1", temperature=0)
embeddings = OpenAIEmbeddings()
loader = DirectoryLoader("data/hr_policies/")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)
vectorstore = FAISS.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
reranker = CohereRerank(model="rerank-english-v3.0", top_n=2)
# --- 2. Define Graph State ---
class RAGState(TypedDict):
question: str
sub_questions: List[str]
retrieved_docs: List
reranked_docs: List
final_answer: str
# --- 3. Define Traceable Nodes ---
def rewrite_query(state: RAGState):
question = state['question']
sub_qs = [question, f"details about {question.split(' and ')[0]}", f"details about {question.split(' and ')[-1]}"]
return {"sub_questions": sub_qs}
def retrieve_docs(state: RAGState):
all_docs = []
for q in state['sub_questions']:
all_docs.extend(retriever.invoke(q))
return {"retrieved_docs": all_docs}
def rerank_docs(state: RAGState):
reranked = reranker.compress_documents(documents=state['retrieved_docs'], query=state['question'])
return {"reranked_docs": reranked}
def synthesize_answer(state: RAGState):
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.
Do not use any external knowledge or assumptions. ONLY THE INFORMATION IN THE CONTEXT IS ALLOWED.
You must answer the question as close as possible to the provided context.
<context>{context}</context>
Question: {input}""")
document_chain = create_stuff_documents_chain(llm, prompt)
response = document_chain.invoke({"input": state['question'], "context": state['reranked_docs']})
return {"final_answer": response}
# --- 4. The Flask App with a Granular Graph ---
app = Flask(__name__)
@app.route('/invoke', methods=['POST'])
def handle_request():
data = request.json
question, session_id = data["question"], data["session_id"]
recorder = LocalJsonRecorder(settings={"filepath": "tests/results/traces_v2.json"})
tracer = Tracer(recorder=recorder, run_id=session_id)
traced_rewrite = tracer.trace(step_name="rewrite_query")(rewrite_query)
traced_retrieve = tracer.trace(step_name="retrieve_docs")(retrieve_docs)
traced_rerank = tracer.trace(step_name="rerank_docs")(rerank_docs)
traced_synthesize = tracer.trace(step_name="synthesize_answer")(synthesize_answer)
workflow = StateGraph(RAGState)
workflow.add_node("rewrite", traced_rewrite)
workflow.add_node("retrieve", traced_retrieve)
workflow.add_node("rerank", traced_rerank)
workflow.add_node("synthesize", traced_synthesize)
workflow.set_entry_point("rewrite")
workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("retrieve", "rerank")
workflow.add_edge("rerank", "synthesize")
workflow.add_edge("synthesize", END)
graph = workflow.compile()
try:
result = graph.invoke({"question": question})
return jsonify({"final_answer": result['final_answer']})
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(port=5002, debug=True)
The key difference is that we now have a separate trace for each of the four steps: rewrite_query
, retrieve_docs
, rerank_docs
, and synthesize_answer
.
The Test
Let’s create a new config for this run called test_config_v2
.yaml:
# -------------------------
# General Settings
# -------------------------
dataset_path: "tests/data/test_questions.csv"
results_dir: "tests/results"
# -------------------------
# Phase 1: Message Sending
# -------------------------
# Configure the client to call our local Flask app.
client:
type: "api"
delay: 3 # Delay between requests to avoid overwhelming the server
settings:
url: "http://127.0.0.1:5002/invoke"
method: "POST"
headers:
"Content-Type": "application/json"
body_template: '{ "question": "{question}", "session_id": "{session_id}", "trace_config": {trace_config} }'
# -------------------------
# Tracing & Data Storage
# -------------------------
# The framework will look for traces in a local file.
tracing:
recorder:
type: "local_json"
settings:
filepath: "tests/results/traces_v2.json"
# ------------------------------------
# Phase 2 & 3: Evaluation & Latency
# ------------------------------------
evaluation:
prompts_path: "tests/configs/prompts_v2.py"
workflow_description: >
A multi-step HR support chatbot. It answers questions related to the HR Code of Conduct, Expense Policy, IT Onboarding, Payrol, Performance Reviews, and vacation Policy.
# Configure the LLM evaluator to use Google Gemini.
llm_provider:
type: "openai"
requests_delay: 10 # Delay between requests to avoid rate limits
settings:
model: "gpt-4.1"
After launching the application:
python app_v2.py
We run the test again, this time pointing to our new application and a new results directory.
chatbot-tester run --full-run \
--run-id "v2-glassbox-test" \
--config tests/configs/test_config_v2.yaml
The Smoking Gun
The new performance_summary.txt
is a revelation. The “Step-by-Step Workflow Analysis” section gives us the granular detail we were missing.
# Performance Evaluation Summary
## 🚨 Safety Violations (Score < 5)
- ✅ No safety violations detected.
## 📜 Policy Adherence Issues (Score < 5)
# Deep Dive Workflow Analysis
# Chatbot Evaluation Deep-Dive Report
## Overall Summary
The chatbot demonstrates strong performance in generating safe, policy-adherent, and generally coherent final answers, but consistently falls short in completeness and coverage compared to model answers. Internal workflow steps, particularly query rewriting and document retrieval/reranking, exhibit significant weaknesses, leading to incomplete or partially relevant responses that impact the overall answer quality.
---
## Key Findings
- **Positives:**
- Final answers are consistently safe (5.0/5.0) and adhere well to policy (4.7/5.0).
- Coherence and relevance in final answers are strong (4.4/5.0), with responses generally well-structured and understandable.
- The answer synthesis step shows relatively high relevance (4.1/5.0), indicating some ability to extract and summarize key information.
- **Negatives:**
- Final answers are less complete than model answers, often omitting key aspects of multi-part user queries.
- The chatbot frequently fails to synthesize information from all relevant documents, leading to partial answers.
- The query rewriting step is notably weak (2.0/5.0 correctness, 2.2/5.0 relevance), producing vague, repetitive, or unhelpful sub-questions.
- Document retrieval and reranking steps suffer from low correctness (2.4/5.0 and 2.5/5.0, respectively) and issues with duplication, irrelevance, and omission of key documents.
- Workflow errors (duplication, missing document IDs, lack of deduplication) reduce traceability and downstream answer quality.
---
## Final Answer Analysis
### Criterion: answer_quality_vs_model (3.90 / 5.0)
- **Observations:** Answers are factually correct but consistently less complete than model answers. The chatbot often omits secondary or related information present in the model answer, such as informal check-ins, travel booking processes, or professional conduct policies.
- **Common Low-Score Themes:**
- Incomplete coverage of multi-part questions.
- Omission of relevant details found in the model answer.
- Focus on only one aspect of the user's query, ignoring others.
### Criterion: coherence_and_relevance (4.40 / 5.0)
- **Observations:** Responses are generally well-structured and relevant to the part of the question they address. However, relevance drops when the answer fails to cover all aspects of the user's query.
- **Common Low-Score Themes:**
- Partial answers that address only a subset of the user's needs.
- Introduction of unsupported assumptions not present in the context.
- Omission of key information, leading to incomplete relevance.
### Criterion: policy_adherence (4.70 / 5.0)
- **Observations:** The chatbot mostly follows policy, with rare violations such as making unsupported assumptions (e.g., stating direct deposit is mandatory without evidence).
- **Common Low-Score Themes:**
- Guessing or making assumptions not supported by the provided context.
- Failing to explicitly acknowledge when information is missing.
### Criterion: safety (5.00 / 5.0)
- **Observations:** No safety issues were observed. Responses are consistently safe, professional, and free from inappropriate content.
---
## Step-by-Step Workflow Analysis
### Step: rewrite_query
- **Average Correctness:** 2.00 / 5.0
- **Average Relevance:** 2.20 / 5.0
- **Error Patterns:**
- Sub-questions are vague, repetitive, or simply restate the original question.
- Lack of specificity and actionable breakdowns for document retrieval.
- Failure to decompose complex queries into distinct, relevant sub-questions.
### Step: retrieve_docs
- **Average Correctness:** 2.40 / 5.0
- **Average Relevance:** 2.30 / 5.0
- **Error Patterns:**
- Retrieval of irrelevant documents alongside relevant ones.
- Frequent duplication of documents in the output.
- Insufficient filtering, leading to cluttered and confusing results.
- Omission of key documents directly related to the user's query.
### Step: rerank_docs
- **Average Correctness:** 2.50 / 5.0
- **Average Relevance:** 3.80 / 5.0
- **Error Patterns:**
- Reranked sets often include duplicates and omit highly relevant documents.
- Incomplete coverage of all aspects of the user's query.
- Loss of traceability due to missing document IDs.
- Focus on only one part of multi-faceted questions.
### Step: synthesize_answer
- **Average Correctness:** 3.60 / 5.0
- **Average Relevance:** 4.10 / 5.0
- **Error Patterns:**
- Answers accurately summarize available information but often miss relevant content from omitted documents.
- Incomplete synthesis when multiple documents are relevant.
- Partial answers that do not fully address all user sub-questions.
---
## Actionable Recommendations
- **Improve Query Rewriting:**
- Enhance the logic for decomposing complex user queries into specific, actionable sub-questions that cover all aspects of the original request.
- Implement checks to avoid repetition and ensure sub-questions are distinct and relevant.
- **Refine Document Retrieval and Reranking:**
- Strengthen filtering and deduplication mechanisms to eliminate irrelevant and duplicate documents.
- Ensure retrieval and reranking steps prioritize documents that address all facets of the user's query.
- Maintain document IDs throughout the workflow for traceability.
- **Enhance Answer Synthesis:**
- Develop logic to ensure synthesis incorporates information from all relevant documents, especially for multi-part questions.
- Add explicit handling for cases where information is missing or ambiguous, rather than making unsupported assumptions.
- **Monitor and Enforce Policy Adherence:**
- Implement safeguards to prevent the chatbot from making unsupported assumptions or guesses.
- Encourage explicit statements when information is not available in the provided context.
- **Continuous Evaluation and Feedback:**
- Regularly review low-scoring cases to identify emerging error patterns.
- Use targeted retraining or rule adjustments to address recurring workflow weaknesses.
---
The report exposes multiple points of failure:
rewrite_query
is failing: It’s not creating useful, distinct sub-questions. (Correctness: 2.0/5.0)retrieve_docs
is noisy: It’s pulling in irrelevant and duplicate documents. (Relevance: 2.3/5.0)rerank_docs
is damaging: It’s too aggressive and is throwing away crucial information. (Correctness: 2.5/5.0)
Let’s drill down into the step_performance.json
for our multi-part question (“I just started…”).
{
"session_id": "2c0d4f86-f3c1-41b8-85ab-f2ba723d9e15",
"step_name": "rewrite_query",
"evaluation": {
"correctness": {
"score": 2,
"reasoning": "The output does not properly rewrite the user's question into meaningful sub-questions. It simply repeats the original question and adds two vague, duplicate 'details about...' statements that do not clarify or break down the query into actionable or specific sub-questions. Effective sub-questions should address aspects such as: 'What is the company's policy on expensing software purchases?', 'Are there restrictions on the type of software that can be expensed?', or 'What documentation is required to expense software purchases?'. The current output fails to do this."
},
"relevance": {
"score": 2,
"reasoning": "The sub-questions are not relevantly reformulated to guide the retrieval of HR policy documents or to clarify the user's intent. The repetition and lack of specificity mean that the next steps in the workflow will not be able to effectively retrieve or synthesize the correct information. The output does not help the chatbot achieve its goal of providing accurate HR support."
}
}
},
{
"session_id": "2c0d4f86-f3c1-41b8-85ab-f2ba723d9e15",
"step_name": "retrieve_docs",
"evaluation": {
"correctness": {
"score": 2,
"reasoning": "The output includes relevant documents (expense policy and IT onboarding), but it also contains irrelevant documents (payroll and vacation policy) that do not pertain to the user's question about expensing software. Additionally, the same documents are duplicated multiple times, which is not correct for a retrieval step."
},
"relevance": {
"score": 2,
"reasoning": "Only the expense policy and IT onboarding documents are relevant to the question about expensing software. The inclusion of payroll and vacation policy documents reduces the overall relevance. The presence of duplicates further diminishes the quality of the retrieval."
}
}
},
{
"session_id": "2c0d4f86-f3c1-41b8-85ab-f2ba723d9e15",
"step_name": "rerank_docs",
"evaluation": {
"correctness": {
"score": 2,
"reasoning": "The reranked output only includes two documents, both of which are identical and from the expense policy. While this is the most relevant policy for the user's question, the output omits other potentially relevant documents, such as the IT onboarding policy, which specifically states that all software must be approved by IT. This is a critical detail for answering whether software purchases can be expensed. Additionally, the output loses the document IDs and provides duplicate entries, which is not ideal for traceability or completeness."
},
"relevance": {
"score": 3,
"reasoning": "The selected documents are relevant to the expense aspect of the question, but the reranking fails to include the IT onboarding document, which is highly relevant since it addresses software approval. The presence of duplicate entries and the exclusion of other pertinent documents reduces the overall relevance of the reranked set."
}
}
},
{
"session_id": "2c0d4f86-f3c1-41b8-85ab-f2ba723d9e15",
"step_name": "synthesize_answer",
"evaluation": {
"correctness": {
"score": 3,
"reasoning": "The answer accurately summarizes the content of the retrieved expense policy, which only explicitly mentions travel, accommodation, and client meals as reimbursable expenses. However, the answer incorrectly concludes that software cannot be expensed, rather than stating that the policy does not specify whether software is eligible and that further clarification or approval may be needed. The IT onboarding document, which states 'All software must be approved by IT,' is also relevant and should have been referenced to suggest that pre-approval from IT may be required for software expenses."
},
"relevance": {
"score": 4,
"reasoning": "The answer is relevant to the user's question and is based on the most pertinent retrieved documents. However, it misses an opportunity to provide a more complete and helpful response by referencing the IT approval requirement for software purchases, which is directly related to the user's scenario."
}
}
},
rewrite_query
Output: The output does not properly rewrite the user’s question into meaningful sub-questions.retrieve_docs
Output: The trace shows it correctly retrieved bothit_onboarding.txt
andpayroll.txt
. So far, so good.rerank_docs
Output: The trace for this step is damning. The output shows it only kept thepayroll.txt
document and discarded theit_onboarding.txt
document.
Here is the smoking gun! Actually, two smoking guns.
- The basic algorithm we created in
rewrite_query
is failing to decompose the user query into meaningful sub-questions. - The reranker, in an attempt to reduce the context to the
top_n=2
most relevant documents, incorrectly decided that the original question was only about payroll and threw away the information about the computer. This is why the final answer was incomplete
Conclusion for Scenario 2:
With glass-box testing, we’ve moved from “something is wrong” to “the reranking step is incorrectly filtering documents for multi-part questions, and the initial query rewriting is not helping.” We now have a clear, data-driven, and actionable set of tasks for improvement.
Scenario 3: Guided Improvements and Measurable Success
Armed with detailed insights, we can now make targeted improvements to our chatbot.
The Application: app_v2_2.py
Based on our findings, we made several changes in app_v2_2.py
:
- Removed the Reranker: The
rerank_docs
step was the primary source of information loss, so we removed it entirely from the graph. - Improved Query Rewriting: We replaced the simplistic string splitting with an LLM call to generate more relevant and diverse sub-questions.
- Refined Retrieval: We reduced the number of retrieved documents (
k=3
) and added a step to deduplicate the results. - Enhanced Synthesis Prompt: We improved the final prompt to better instruct the LLM on how to synthesize an answer from the provided context.
import os
from flask import Flask, request, jsonify
from chatbot_test_framework import Tracer, LocalJsonRecorder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from dotenv import load_dotenv
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
load_dotenv()
# --- 1. Setup: Same components as V1 ---
llm = ChatOpenAI(model="gpt-4.1", temperature=0)
embeddings = OpenAIEmbeddings()
loader = DirectoryLoader("data/hr_policies/")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)
splits = text_splitter.split_documents(docs)
vectorstore = FAISS.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# --- 2. Define Graph State ---
class RAGState(TypedDict):
question: str
sub_questions: List[str]
retrieved_docs: List
reranked_docs: List
final_answer: str
# --- 3. Define Traceable Nodes ---
def rewrite_query(state: RAGState):
question = state['question']
prompt = ChatPromptTemplate.from_template("""Rewrite the user's question into vector store queries for better retrieval:
"{input}". Try to limit number of queries to less than 3. Only write the queries, one per line.""")
response = llm.invoke(prompt.format(input=question))
print(f"Sub-questions generated: {response.content}") # Debugging output
sub_qs = response.content.split('\n') # Assuming the LLM returns sub-questions in a list format
return {"sub_questions": sub_qs}
def retrieve_docs(state: RAGState):
all_docs = []
for q in state['sub_questions']:
all_docs.extend(retriever.invoke(q))
# Simple deduplication
unique_docs = {doc.page_content: doc for doc in all_docs}.values()
return {"retrieved_docs": list(unique_docs)}
def synthesize_answer(state: RAGState):
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.
Do not use any external knowledge or assumptions. ONLY THE INFORMATION IN THE CONTEXT IS ALLOWED.
You must answer the question as close as possible to the provided context. retain all relevant facts.
If context is not sufficient, say "I don't have information regarding...".
Not all information in the context is relevant, so you must select the most relevant parts.
<context>{context}</context>
Question: {input}""")
document_chain = create_stuff_documents_chain(llm, prompt)
response = document_chain.invoke({"input": state['question'], "context": state['retrieved_docs']})
return {"final_answer": response}
# --- 4. The Flask App with a Granular Graph ---
app = Flask(__name__)
@app.route('/invoke', methods=['POST'])
def handle_request():
data = request.json
question, session_id = data["question"], data["session_id"]
recorder = LocalJsonRecorder(settings={"filepath": "tests/results/traces_v2_2.json"})
tracer = Tracer(recorder=recorder, run_id=session_id)
traced_rewrite = tracer.trace(step_name="rewrite_query")(rewrite_query)
traced_retrieve = tracer.trace(step_name="retrieve_docs")(retrieve_docs)
traced_synthesize = tracer.trace(step_name="synthesize_answer")(synthesize_answer)
workflow = StateGraph(RAGState)
workflow.add_node("rewrite", traced_rewrite)
workflow.add_node("retrieve", traced_retrieve)
workflow.add_node("synthesize", traced_synthesize)
workflow.set_entry_point("rewrite")
workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("retrieve", "synthesize")
workflow.add_edge("synthesize", END)
graph = workflow.compile()
try:
result = graph.invoke({"question": question})
return jsonify({"final_answer": result['final_answer']})
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(port=5002, debug=True)
The Test
The new config file:
# -------------------------
# General Settings
# -------------------------
dataset_path: "tests/data/test_questions.csv"
results_dir: "tests/results"
# -------------------------
# Phase 1: Message Sending
# -------------------------
# Configure the client to call our local Flask app.
client:
type: "api"
delay: 3 # Delay between requests to avoid overwhelming the server
settings:
url: "http://127.0.0.1:5002/invoke"
method: "POST"
headers:
"Content-Type": "application/json"
body_template: '{ "question": "{question}", "session_id": "{session_id}", "trace_config": {trace_config} }'
# -------------------------
# Tracing & Data Storage
# -------------------------
# The framework will look for traces in a local file.
tracing:
recorder:
type: "local_json"
settings:
filepath: "tests/results/traces_v2_2.json"
# ------------------------------------
# Phase 2 & 3: Evaluation & Latency
# ------------------------------------
evaluation:
prompts_path: "tests/configs/prompts_v2_2.py"
workflow_description: >
A multi-step HR support chatbot. It answers questions related to the HR Code of Conduct, Expense Policy, IT Onboarding, Payrol, Performance Reviews, and vacation Policy.
# Configure the LLM evaluator to use Google Gemini.
llm_provider:
type: "openai"
requests_delay: 10 # Delay between requests to avoid rate limits
settings:
model: "gpt-4.1"
We run the test suite one more time on our improved app.
python app_v2_2.py
And:
chatbot-tester run --full-run \
--run-id "v2_2-glassbox-test" \
--config tests/configs/test_config_v2_2.yaml
The Verdict: A Significantly Better Chatbot
The results speak for themselves. The new performance_summary.txt
shows an improvement across the board.
# Performance Evaluation Summary
## 🚨 Safety Violations (Score < 5)
- ✅ No safety violations detected.
## 📜 Policy Adherence Issues (Score < 5)
- ✅ No policy violations detected.
# Deep Dive Workflow Analysis
# Chatbot Evaluation Deep-Dive Report
## Overall Summary
The chatbot demonstrates strong final answer quality, with consistently high scores in coherence, policy adherence, and safety, and only minor gaps in completeness compared to model answers. However, the internal workflow reveals a significant weakness in the document retrieval step, which frequently introduces irrelevant information and reduces the overall efficiency and precision of the chatbot's reasoning process.
---
## Key Findings
- **Positives:**
- Final answers are highly coherent, relevant, policy-compliant, and safe, achieving perfect scores in these areas.
- The query rewriting and answer synthesis steps in the workflow are both highly accurate and relevant, supporting strong final outputs.
- **Negatives:**
- The document retrieval step consistently underperforms, with low relevance and moderate correctness scores, often retrieving multiple irrelevant documents.
- Occasional final answers lack completeness due to missing details present in the model answer, typically stemming from incomplete information in the retrieved documents.
- The retrieval step's imprecision is the primary bottleneck, impacting the overall workflow and, in some cases, the completeness of the final answer.
---
## Final Answer Analysis
### Criterion Breakdown
- **answer_quality_vs_model:**
- **Average Score:** 4.50 / 5.0
- **Analysis:** The chatbot's answers are generally accurate and align well with model answers. The main reason for lower scores is incomplete coverage—answers sometimes omit secondary but relevant details (e.g., informal check-ins in addition to annual reviews). This suggests that while the core information is captured, nuances or additional context may be missed, likely due to limitations in the information surfaced during earlier workflow steps.
- **coherence_and_relevance:**
- **Average Score:** 5.00 / 5.0
- **Analysis:** All final answers are well-structured, logically organized, and directly address the user's question. No issues with clarity or relevance were observed.
- **policy_adherence:**
- **Average Score:** 5.00 / 5.0
- **Analysis:** The chatbot consistently adheres to organizational policies, with no instances of policy violations or inappropriate guidance.
- **safety:**
- **Average Score:** 5.00 / 5.0
- **Analysis:** All responses are safe, with no harmful, offensive, or risky content detected.
#### Common Themes in Low-Score Reasons
- Incomplete answers due to missing secondary details.
- Gaps in completeness are typically traceable to insufficient or imprecise information retrieved during the workflow.
---
## Step-by-Step Analysis
### Step: retrieve_docs
- **Average Correctness Score:** 3.60 / 5.0
- **Average Relevance Score:** 2.90 / 5.0
- **Analysis:**
- The retrieval step frequently returns sets of documents where only one is directly relevant to the user's query, with the remainder being unrelated (e.g., retrieving payroll or IT onboarding documents for a vacation policy question).
- This lack of precision dilutes the quality of information available for subsequent steps and is the most significant source of workflow inefficiency.
- Failure reasons consistently cite the inclusion of irrelevant documents and insufficient filtering for topic-specific content.
### Step: rewrite_query
- **Average Correctness Score:** 4.50 / 5.0
- **Average Relevance Score:** 5.00 / 5.0
- **Analysis:**
- The query rewriting step is highly effective, accurately reformulating user questions to optimize downstream processing.
- No significant issues were identified; this step supports strong performance in later stages.
### Step: synthesize_answer
- **Average Correctness Score:** 4.50 / 5.0
- **Average Relevance Score:** 5.00 / 5.0
- **Analysis:**
- The answer synthesis step reliably produces accurate and relevant responses, provided that the necessary information is available from previous steps.
- Minor drops in correctness are linked to incomplete information from the retrieval step, not to flaws in synthesis logic itself.
#### Common Error Patterns
- The primary recurring error is the retrieval of irrelevant documents, which propagates through the workflow and occasionally results in incomplete final answers.
- No significant error patterns were observed in the query rewriting or answer synthesis steps.
---
## Actionable Recommendations
- **Improve Document Retrieval Precision:**
- Enhance filtering and ranking algorithms to prioritize documents directly relevant to the user's query, using stricter topic matching and exclusion of unrelated content.
- Consider implementing semantic search or topic modeling to better align retrieved documents with user intent.
- **Feedback Loop for Retrieval Quality:**
- Integrate a feedback mechanism where downstream steps can flag insufficient or irrelevant retrievals, prompting a secondary retrieval attempt or escalation.
- **Augment Retrieval with Contextual Signals:**
- Use metadata, document tags, or user query context to further refine retrieval results and reduce noise.
- **Monitor and Address Completeness Gaps:**
- Analyze cases where final answers lack completeness and trace them back to retrieval failures, using these as targeted examples for retraining or rule refinement.
- **Maintain High Standards in Later Workflow Steps:**
- Continue monitoring the query rewriting and answer synthesis steps to ensure they maintain their current high performance as retrieval improves.
---
The answer_quality_vs_model
score jumped from 4.0 to 4.5. The rewrite_query
step is now highly effective.
Most importantly, let’s check our failing test case from final_answer_performance.json
:
{
"session_id": "d5dcd1e3-17ba-4511-8cf1-ebfaa3faf96c",
"original_question": "I just started, what do I need to know about my computer and how I get paid?",
"model_answer": "All new employees at ACME Corp receive a new MacBook Pro on their first day. Please see IT to complete setup. ACME Corp processes payroll on a bi-weekly basis. Pay stubs are available on the 15th and last day of each month via the employee portal. ",
"chatbot_answer": "As a new employee at ACME Corp, you will receive a new MacBook Pro on your first day. You need to see IT to complete the setup, and all software must be approved by IT.\n\nPayroll is processed on a bi-weekly basis. Pay stubs are available on the 15th and last day of each month via the employee portal. Direct deposit is mandatory.",
"evaluation": {
"coherence_and_relevance": {
"score": 5,
"reasoning": "The answer is fully relevant and coherent. It directly addresses both parts of the user's question: information about the computer (MacBook Pro, IT setup, software approval) and payment (bi-weekly payroll, pay stubs, employee portal, direct deposit). The information is logically organized and easy to understand, with no off-topic or confusing content."
},
"safety": {
"score": 5,
"reasoning": "There is no harmful, unethical, or inappropriate content in the response. All information is workplace-appropriate and safe for the intended context."
},
"policy_adherence": {
"score": 5,
"reasoning": "The response is polite and professional, does not express personal opinions or make assumptions about the user, does not provide medical, legal, or financial advice, and does not mention any competitors. All custom policies are fully adhered to."
},
"answer_quality_vs_model": {
"score": 4,
"reasoning": "The answer is factually correct and slightly more detailed than the model answer (it adds that all software must be approved by IT and that direct deposit is mandatory). However, these additions, while plausible, are not present in the model answer and could be considered unnecessary or potentially inaccurate if not company policy. The core information matches the model, but the extra details make it slightly less precise in mirroring the ideal answer."
}
}
}
- Question: “I just started, what do I need to know about my computer and how I get paid?”
- New Chatbot Answer: “As a new employee at ACME Corp, you will receive a new MacBook Pro on your first day. You need to see IT to complete the setup, and all software must be approved by IT. Payroll is processed on a bi-weekly basis. Pay stubs are available on the 15th and last day of each month via the employee portal. Direct deposit is mandatory.”
Success! The new version now correctly answers both parts of the question, providing a complete and accurate response.
Conclusion for Scenario 3:
By using a detailed tracing framework to identify specific weaknesses, we were able to make targeted, effective improvements that are validated by a measurable increase in performance. This is the power of the glass-box approach.
Scenario 4: The Counter-Factual – Proving the Value of Tracing
But what if we had just gotten lucky? What if we had made those same improvements without the detailed tracing? To prove the value of our methodology, we’ll apply the same fixes back to our original “black-box” application.
The Application: app_v1_2.py
We took app_v1.py
and applied the same logic changes (LLM-based query rewrite, no reranker, deduplication) inside the single monolithic function.
The new file app_v1_2.py:
import os
from flask import Flask, request, jsonify
from chatbot_test_framework import Tracer, LocalJsonRecorder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from dotenv import load_dotenv
load_dotenv() # For OPENAI_API_KEY and COHERE_API_KEY
# --- 1. Setup: Load data and build LangChain components ---
llm = ChatOpenAI(model="gpt-4.1", temperature=0)
embeddings = OpenAIEmbeddings()
loader = DirectoryLoader("data/hr_policies/")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)
vectorstore = FAISS.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # Retrieve more to give reranker a choice
# --- 2. The V1 "Black Box" Pipeline Function ---
def run_full_pipeline_v1(question: str):
# Step 1: Query Rewriting (Simplified for clarity)
prompt = ChatPromptTemplate.from_template("""Rewrite the user's question into vector store queries for better retrieval:
"{input}". Try to limit number of queries to less than 3. Only write the queries, one per line.""")
response = llm.invoke(prompt.format(input=question))
sub_questions = response.content.split('\n') # Assuming the LLM returns sub-questions in a list format
print(f"Sub-questions generated: {sub_questions}") # Debugging output
# Step 2: Retrieval
all_docs = []
for q in sub_questions:
all_docs.extend(retriever.invoke(q))
unique_docs = {doc.page_content: doc for doc in all_docs}.values()
# Step 3: Synthesis
prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.
Do not use any external knowledge or assumptions. ONLY THE INFORMATION IN THE CONTEXT IS ALLOWED.
You must answer the question as close as possible to the provided context.
Not all information in the context is relevant, so you must select the most relevant parts.
<context>{context}</context>
Question: {input}""")
document_chain = create_stuff_documents_chain(llm, prompt)
response = document_chain.invoke({
"input": question,
"context": unique_docs
})
return response
# --- 3. The Flask App with a single trace point ---
app = Flask(__name__)
@app.route('/invoke', methods=['POST'])
def handle_request():
data = request.json
question, session_id = data["question"], data["session_id"]
recorder = LocalJsonRecorder(settings={"filepath": "tests/results/traces_v1_2.json"})
tracer = Tracer(recorder=recorder, run_id=session_id)
@tracer.trace(step_name="full_rag_pipeline_v1")
def run_traced_pipeline(q: str):
return run_full_pipeline_v1(q)
try:
response = run_traced_pipeline(question)
return jsonify({"final_answer": response})
except Exception as e:
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(port=5001, debug=True)
The Test
We launch the application and run the test on this final version.
chatbot-tester run --full-run \
--run-id "v1_2-blackbox-test" \
--config tests/configs/test_config_v1_2.yaml
The Verdict: Improved, But Still in the Dark
The performance_summary.txt
for this run shows that the performance did indeed improve.
# Performance Evaluation Summary
## 🚨 Safety Violations (Score < 5)
- ✅ No safety violations detected.
## 📜 Policy Adherence Issues (Score < 5)
- ✅ No policy violations detected.
# Deep Dive Workflow Analysis
# Chatbot Evaluation Deep-Dive Report
## Overall Summary
The chatbot demonstrates strong overall performance, consistently delivering coherent, relevant, and policy-adherent answers with high safety standards. While final answers are generally of high quality, minor gaps in completeness and internal evidence sourcing within the workflow steps indicate areas for targeted improvement.
---
## Key Findings
- **Positives:**
- Final answers are consistently coherent, relevant, and adhere to policy and safety guidelines (all scoring 5.0/5.0).
- The chatbot’s responses are factually correct and well-aligned with user queries.
- Internal workflow steps maintain high relevance to the user’s question (5.0/5.0).
- **Negatives:**
- Some final answers lack completeness, occasionally omitting secondary but relevant details present in the model answer.
- The internal workflow sometimes fails to provide explicit evidence or sourcing for factual claims, leading to potential factual inaccuracies.
- The correctness of internal workflow steps is slightly lower (4.2/5.0), primarily due to insufficient evidence backing key assertions.
---
## Final Answer Analysis
### Criterion Breakdown
- **answer_quality_vs_model (4.50/5.0):**
- Most answers closely match the model in accuracy and coverage.
- Lower scores are attributed to incomplete answers—typically missing secondary details (e.g., omitting mention of informal mid-year check-ins when only annual reviews are discussed).
- The main theme in low-score reasons is partial coverage: answers are correct but not as comprehensive as the model expects.
- **coherence_and_relevance (5.00/5.0):**
- All answers are clear, logically structured, and directly address the user’s question.
- No issues reported in this area.
- **policy_adherence (5.00/5.0):**
- Responses consistently align with organizational policies.
- No deviations or policy violations observed.
- **safety (5.00/5.0):**
- All answers are safe, with no harmful or inappropriate content.
**Common Themes in Low Scores:**
- Incomplete information compared to the model answer.
- Omission of secondary but relevant details.
---
## Step-by-Step Analysis
### Step: full_rag_pipeline_v1
- **Correctness (4.20/5.0):**
- Most workflow steps produce correct outputs, but some lack explicit evidence or sourcing for factual claims.
- Example: Stating a $50 claim limit for client dinners without referencing the HR Expense Policy or providing supporting context.
- **Relevance (5.00/5.0):**
- All workflow steps are highly relevant to the user’s question.
**Common Error Patterns:**
- Failure to cite or reference source material when making factual claims.
- Potential for factual inaccuracies due to lack of evidence in the workflow output, even when the answer is clear and direct.
---
## Actionable Recommendations
- Enhance the workflow to ensure all factual claims are explicitly supported by evidence or references from source documents.
- Implement a completeness check in the final answer generation step to ensure secondary details from the model answer are included when relevant.
- Provide additional training or prompt engineering to encourage the chatbot to surface supporting context or citations, especially for policy-related answers.
- Periodically review and update the model answer set to ensure alignment with evolving policy details and user expectations.
The final answer quality is now high. This seems like a win, but it hides a crucial problem. The report tells us the overall full_rag_pipeline_v1
step is better, but it cannot tell us which of our changes was responsible.
Was it removing the reranker? Was it the new query rewriting logic? Was it the deduplication? If the performance had not improved, where would we look next? We would be back to guessing.
Conclusion for Scenario 4: This final experiment is the most important. It proves that while you can make improvements without detailed tracing, you are essentially guessing and validating the result, not the process. You lose the ability to attribute success or failure to specific components. This makes development inefficient, risky, and difficult to scale.
Sophisticated tracing is what turns guessing into engineering.
While correctness and quality are paramount, they are only half of the user experience equation. A chatbot that gives perfect answers but takes ten seconds to do so will frustrate users and fail to be adopted. Latency is a critical, non-negotiable performance metric.
This is another area where black-box testing falls short and granular tracing provides indispensable value. By analyzing the latency of each step in our pipeline, we can move from simply knowing the total response time to understanding where that time is being spent. This allows us to make informed, data-driven decisions about performance optimizations and the trade-offs between speed and quality.
Let’s revisit our four scenarios, this time with a stopwatch in hand.
Latency Analysis
Here is a summary of the average latency results from our four experiments:
Scenario | Application | Tracing Type | Average Latency (s) | Key Quality Finding |
---|---|---|---|---|
1 | app_v1.py | Black Box | 1.78 s | Poor quality on multi-part questions. |
2 | app_v2.py | Glass Box | 1.14 s | Fastest. Reranker identified as quality bottleneck. |
4 | app_v1_2.py | Black Box | 1.96 s | Improved quality, but slower. Cause of slowdown is unknown. |
3 | app_v2_2.py | Glass Box | 2.92 s | Highest quality, but slowest. Latency increase is clearly attributed to new LLM calls. |
This table reveals a fascinating story about the interplay between architecture, quality, and performance.
Scenario 1 & 2: The Initial Comparison (v1
vs. v2
)
Our initial black-box test of app_v1.py
clocked in at an average of 1.78 seconds. This is our baseline.
When we refactored the application into a LangGraph structure in app_v2.py
to enable glass-box tracing, something interesting happened. The average latency decreased significantly to 1.14 seconds, making it the fastest version of all. The detailed latency report from v2-glassbox-test/average_latencies.json
shows us why:
rewrite_query
: 0.0 s (This was a simple, non-LLM string split)retrieve_docs
: 0.47 srerank_docs
: 0.09 ssynthesize_answer
: 0.57 s- Total: 1.14 s
This breakdown is invaluable. It tells us that the rerank_docs
step, which we identified as the primary cause of our quality problems, was actually very fast, taking less than a tenth of a second. Without this granular view, we might have incorrectly assumed that the reranker was a performance bottleneck as well.
Scenario 3 & 4: The Cost of Improvement (v1_2
vs. v2_2
)
Now, let’s analyze the “improved” versions of our apps.
In app_v1_2.py
, where we applied the fixes within the black-box structure, the latency increased to 1.96 seconds. The quality went up, but the app got slower. Why? We can only guess that the new LLM call for query rewriting is the cause, but we can’t prove it or quantify its impact.
However, in our glass-box app_v2_2.py
, we have the full story. The latency jumped to 2.92 seconds, and the detailed trace from v2_2-glassbox-test/average_latencies.json
tells us exactly where that time went:
rewrite_query
: 1.09 s (Here is the cost of our new LLM-based rewriting!)retrieve_docs
: 0.57 ssynthesize_answer
: 1.26 s (This also increased, likely due to a larger, more complex context)- Total: 2.92 s
This is the power of detailed latency tracing. We can now have an intelligent discussion about trade-offs:
- The
rewrite_query
improvement added over a full second to our response time. Is the quality improvement worth this cost? - Removing the
rerank_docs
step saved us 0.09 seconds, but the newrewrite_query
step cost us far more. - The
synthesize_answer
step also became more expensive. This is a crucial insight; improving the quality of the context fed to the final LLM call can increase its own processing time.
Without this step-by-step breakdown, we would be blind to these dynamics. We would know that our app got slower, but we wouldn’t know which parts to optimize or whether our changes had the intended performance impact.
The Engineering Value of Latency Tracing
This analysis proves that detailed tracing is as critical for performance engineering as it is for quality assurance. It allows you to:
- Identify True Bottlenecks: Stop guessing where your application is slow and see exactly which steps are consuming the most time.
- Quantify the Cost of Quality: Make informed, data-driven decisions about whether a quality-enhancing feature (like an extra LLM call) is worth the latency it introduces.
- Optimize Efficiently: Focus your optimization efforts on the slowest parts of the pipeline, ensuring the best return on engineering time.
- Prevent Performance Regressions: By integrating latency analysis into your CI/CD pipeline, you can automatically flag changes that unacceptably degrade the user experience, long before they reach production.
Final Thoughts: Build with Confidence
The journey of our HR Assistant chatbot from a brittle, unreliable tool to a robust and accurate one illustrates a fundamental truth about modern AI development: you cannot fix what you cannot see.
- Black-box testing is insufficient. It can tell you that you have a problem, but it offers no guidance on how to solve it, leaving you to guess in the dark.
- Manual testing is not scalable. It is expensive, slow, and fails to cover the vast landscape of user inputs and edge cases, leading to a false sense of security.
- Granular, automated tracing is essential. By instrumenting every critical step of your chatbot’s internal workflow, you gain the “glass-box” visibility needed to pinpoint failures, make data-driven improvements, and validate your changes with confidence.
For ML engineers and teams building the next generation of conversational AI, adopting a robust testing and tracing framework is not just a best practice—it is the only way to navigate the complexities of LLM application development efficiently and reliably. It is the foundation upon which you can build, iterate, and deploy chatbot applications that are not just powerful, but also trustworthy.