Why Use Evaluations
Quality Monitoring
Track response quality metrics over time to catch degradations early
A/B Testing
Compare different models, prompts, or parameters to optimize performance
Compliance
Ensure outputs meet safety, policy, and regulatory requirements
Continuous Improvement
Use evaluation scores to build better training datasets and refine prompts
Evaluation Methods
Helicone supports multiple approaches to evaluate your LLM outputs:LLM as a Judge
Use another LLM to evaluate response quality:Custom Evaluators
Deploy your own evaluation logic using any infrastructure:External Services
Integrate third-party evaluation platforms:Scoring Mechanisms
Score Types
Helicone supports various scoring approaches:- Numeric Scores
- Boolean Checks
- Categorical
Rate responses on a numerical scale:
Storing Scores
Send evaluation scores to Helicone via the API:Setting Up Evaluations
Using Webhooks
The most common pattern for automated evaluation:Using Scoring Workers
Deploy dedicated workers for evaluation:Analytics and Monitoring
Score Dashboard
Track evaluation metrics over time in the Helicone dashboard:- Score trends - Monitor how quality changes over time
- Score distribution - See the spread of scores across requests
- Model comparison - Compare scores between different models
- Filter by properties - Analyze scores by environment, user, or feature
Alerting on Scores
Combine evaluations with alerts to catch quality issues:Experiment Tracking
Use scores to compare experiments:Best Practices
Multiple Evaluators
Use diverse evaluation methods to catch different types of issues
Sampling Strategy
Evaluate a representative sample rather than every request to reduce costs
Human-in-the-Loop
Combine automated scores with periodic human review for calibration
Version Control
Track evaluator versions to understand score changes over time
Common Evaluation Patterns
Quality Metrics
Safety Checks
Performance Metrics
Related Features
Webhooks
Trigger evaluations automatically when requests complete
Datasets
Build evaluation datasets from scored production data
Experiments
Compare evaluation scores across different configurations
Alerts
Get notified when evaluation scores drop below thresholds
Evaluations help you maintain and improve LLM quality over time. Start with simple scoring metrics, then expand to more sophisticated evaluation methods as your application matures.