Step-by-step tutorial for integrating RAGAS evaluation scores with Helicone for RAG observability
Learn how to evaluate Retrieval-Augmented Generation (RAG) applications using RAGAS and report evaluation scores to Helicone for centralized observability.
Create a RAG function that tracks Helicone request IDs:
def rag_query(question: str, contexts: list[str]) -> tuple[str, str]: """ Perform RAG query and return answer + Helicone request ID. Args: question: User question contexts: Retrieved context chunks from vector DB Returns: Tuple of (answer, helicone_request_id) """ # Format context for prompt context_text = "\n\n".join(contexts) # Make RAG request response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "system", "content": "Answer the question based only on the provided context." }, { "role": "user", "content": f"Context:\n{context_text}\n\nQuestion: {question}" } ], extra_headers={ "Helicone-Property-Feature": "rag-qa", "Helicone-Property-Environment": "evaluation", } ) answer = response.choices[0].message.content # Extract Helicone request ID from response helicone_id = response.id # OpenAI's response ID # Note: In practice, you'd get this from response headers # when using .with_response() or similar methods return answer, helicone_id
To get the Helicone request ID from response headers, use the OpenAI client’s .with_response() method or inspect response headers directly.
# Define test casestest_cases = [ { "question": "What is the capital of France?", "contexts": [ "France is a country in Western Europe.", "Paris is the capital and largest city of France.", "The city has a population of over 2 million people." ], "ground_truth": "Paris" }, { "question": "When was the Eiffel Tower built?", "contexts": [ "The Eiffel Tower was constructed from 1887 to 1889.", "It was designed by engineer Gustave Eiffel.", "The tower is located in Paris, France." ], "ground_truth": "1887-1889" }, { "question": "What is photosynthesis?", "contexts": [ "Photosynthesis is a process used by plants to convert light energy into chemical energy.", "It occurs in the chloroplasts of plant cells.", "Carbon dioxide and water are converted into glucose and oxygen." ], "ground_truth": "A process by which plants convert light energy into chemical energy" }]# Run evaluationif __name__ == "__main__": results = evaluate_rag_pipeline(test_cases) # Print summary print("\n" + "="*50) print("EVALUATION SUMMARY") print("="*50) avg_scores = { "faithfulness": 0, "answer_relevancy": 0, "context_precision": 0, "context_recall": 0 } for result in results: for metric, score in result["scores"].items(): avg_scores[metric] += score for metric in avg_scores: avg_scores[metric] /= len(results) print(f"{metric}: {avg_scores[metric]:.2%}") print(f"\nView detailed results in Helicone: https://helicone.ai/requests")
Evaluating test case 1/3...Question: What is the capital of France?Answer: The capital of France is Paris.Scores: {'faithfulness': 1.0, 'answer_relevancy': 0.98, 'context_precision': 1.0, 'context_recall': 1.0}Scores reported successfully for request req_abc123Evaluating test case 2/3...Question: When was the Eiffel Tower built?Answer: The Eiffel Tower was constructed between 1887 and 1889.Scores: {'faithfulness': 1.0, 'answer_relevancy': 1.0, 'context_precision': 1.0, 'context_recall': 1.0}Scores reported successfully for request req_def456Evaluating test case 3/3...Question: What is photosynthesis?Answer: Photosynthesis is a process used by plants to convert light energy into chemical energy.Scores: {'faithfulness': 1.0, 'answer_relevancy': 0.95, 'context_precision': 0.92, 'context_recall': 0.88}Scores reported successfully for request req_ghi789==================================================EVALUATION SUMMARY==================================================faithfulness: 100.00%answer_relevancy: 97.67%context_precision: 97.33%context_recall: 96.00%View detailed results in Helicone: https://helicone.ai/requests
Start with a golden dataset: Create 20-50 high-quality test cases with ground truth answers
Run evaluations regularly: Set up automated evaluations to catch regressions early
Track score trends: Monitor how metrics change over time, especially after prompt changes
Investigate outliers: Low-scoring responses often reveal edge cases or data quality issues
RAGAS requires an LLM to calculate some metrics, which adds cost and latency. Consider evaluating a sample of production traffic rather than every request.