Debug LLM Applications with Helicone

Learn how to quickly identify and fix issues in production LLM applications using Helicone’s debugging features.

What You’ll Learn

Set up comprehensive request logging
Use filters to isolate problematic requests
Debug errors and unexpected outputs
Track prompt performance over time
Identify and fix latency issues

Prerequisites

Helicone API key (get one here)
An LLM application in production
Basic understanding of your application’s architecture

Common Debugging Scenarios

This tutorial covers:

Finding and fixing errors (4XX/5XX)
Debugging unexpected model outputs
Identifying latency bottlenecks
Tracking down cost spikes
Investigating user-reported issues

Step 1: Enable Comprehensive Logging

Add headers to capture debugging context:

import { OpenAI } from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: "https://oai.helicone.ai/v1",
  defaultHeaders: {
    "Helicone-Auth": `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Add debugging context to every request
function makeRequest(userId: string, feature: string, input: string) {
  return client.chat.completions.create(
    {
      model: "gpt-4o",
      messages: [{ role: "user", content: input }],
    },
    {
      headers: {
        // Essential debugging headers
        "Helicone-User-Id": userId,
        "Helicone-Property-Feature": feature,
        "Helicone-Property-Environment": process.env.NODE_ENV,
        "Helicone-Property-Version": "v2.1.0",
        
        // Optional: Add custom request ID for correlation
        "Helicone-Request-Id": `${feature}-${Date.now()}`,
      },
    }
  );
}

Key Debugging Headers:

Helicone-User-Id: Identify which users experience issues
Helicone-Property-Feature: Isolate problems to specific features
Helicone-Property-Environment: Separate dev/staging/production issues
Helicone-Property-Version: Track which code version has problems

Scenario 1: Finding and Fixing Errors

Problem: Users reporting 500 errors

Navigate to Requests Dashboard

Go to Helicone Requests

Filter for Errors

Apply filters:

Status Code: 500 (or 4XX/5XX)
Time Range: Last 24 hours
Environment: production

Identify Patterns

Look at the error list:

Are errors concentrated on a specific feature?
Affecting specific users?
Started at a specific time?

Example findings:

23 errors in "document-analysis" feature
All started after 2:30 PM
Error message: "Rate limit exceeded"

Inspect Request Details

Click on an error to see:

Full request payload
Error response
Model used
Request headers
Timestamp

{
  "error": {
    "message": "Rate limit exceeded for gpt-4",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

Fix the Issue

Based on findings:

// Add retry logic with exponential backoff
import { setTimeout } from 'timers/promises';

async function makeRequestWithRetry(
  userId: string,
  feature: string,
  input: string,
  maxRetries = 3
) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await makeRequest(userId, feature, input);
    } catch (error: any) {
      if (error.status === 429 && attempt < maxRetries - 1) {
        const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
        console.log(`Rate limited, retrying in ${delay}ms...`);
        await setTimeout(delay);
      } else {
        throw error;
      }
    }
  }
}

Monitor the Fix

Set up an alert to catch future issues:

Go to Settings → Alerts
Create alert:
- Metric: Error Rate
- Threshold: > 5%
- Time window: 10 minutes
- Filter: Feature = “document-analysis”
Add Slack/email notification

Scenario 2: Debugging Unexpected Outputs

Problem: Model generating incorrect format

Find Problematic Requests

Filter requests:

Feature: data-extraction
Time Range: Last 7 days
Sort by: Recent first

Review Request/Response

Click on a request to see:

// Request
{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "Extract data as JSON"
    },
    {
      "role": "user",
      "content": "Name: John Doe, Age: 30"
    }
  ]
}

// Response (incorrect)
"The person's name is John Doe and they are 30 years old."

// Expected
{"name": "John Doe", "age": 30}

Identify the Issue

The prompt is too vague. The model needs clearer instructions.

Test Fix in Dashboard

Use Helicone’s prompt testing feature or test locally:

// Improved prompt
const response = await client.chat.completions.create(
  {
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Extract data and respond ONLY with valid JSON.
        
Format: {"name": "string", "age": number}
        
Do not include any explanation or additional text.`
      },
      {
        role: "user",
        content: "Name: John Doe, Age: 30"
      }
    ],
    temperature: 0,  // More deterministic
  },
  {
    headers: {
      "Helicone-Property-Version": "v2.2.0",  // Track new version
    }
  }
);

Compare Versions

After deploying, compare old vs. new:

Filter 1: Version = v2.1.0
Filter 2: Version = v2.2.0

Compare success rates, costs, latencies

Scenario 3: Identifying Latency Issues

Problem: Slow response times

Filter Slow Requests

Latency: > 5000ms (5 seconds)
Time Range: Last 24 hours
Sort by: Latency (descending)

Analyze Patterns

Look for:

Specific models (GPT-4 vs. GPT-4o-mini)
Request size (token count)
Features with long prompts

Example findings:

Feature: report-generation
Average latency: 12.3s
Token count: 8,500 tokens (very large)
Model: gpt-4o

Optimize

// Before: Including entire document
const largePrompt = `Analyze this document:\n${fullDocument}`;

// After: Summarize or chunk first
const optimizedPrompt = `Analyze this summary:\n${summarizeDocument(fullDocument)}`;

Set Latency Alert

Create alert:

Metric: Latency
Threshold: P95 > 5000ms
Time window: 1 hour
Feature: report-generation

Scenario 4: Investigating Cost Spikes

Problem: Unexpected $500 charge

View Cost Dashboard

Go to Helicone Dashboard and check:

Daily cost trend (when did spike occur?)
Cost by feature
Cost by user

Filter High-Cost Requests

Date: [Date of spike]
Sort by: Cost (descending)

Findings:

Top request: $12.50 (!)  
User: user-789
Feature: document-analysis
Tokens: 125,000 (prompt) + 8,000 (completion)

Investigate the Request

Click on expensive request:

{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "[Entire 500-page PDF content]..."  // Problem!
    }
  ]
}

User uploaded massive document without chunking.

Implement Safeguards

function validateInput(text: string): void {
  const estimatedTokens = text.length / 4; // Rough estimate
  const MAX_TOKENS = 50000;
  
  if (estimatedTokens > MAX_TOKENS) {
    throw new Error(
      `Input too large (${estimatedTokens} tokens). Maximum: ${MAX_TOKENS}`
    );
  }
}

// Add cost-per-request limit
function checkCostLimit(user: User): void {
  if (user.tier === "free" && user.monthlySpend > 10) {
    throw new Error("Monthly limit reached. Upgrade to continue.");
  }
}

Scenario 5: User-Reported Issue

Problem: “User user-456 says chatbot gave wrong answer yesterday”

Find User's Requests

Filter by:
- User ID: user-456
- Date: [Yesterday]
- Feature: chatbot

Review Conversation

If using sessions:

Go to: Sessions
Filter: User ID = user-456
Find relevant session by timestamp

View entire conversation flow to understand context.

Identify Issue

Review the specific request/response:

Was context missing?
Did model hallucinate?
Was there a misunderstanding?

Share findings with user:

Found the issue: The chatbot didn't have access to the
latest product pricing, which was updated yesterday morning.
We're adding a knowledge base refresh to fix this.

Advanced: Custom Request IDs

Correlate Helicone logs with your application logs:

const appRequestId = generateId(); // Your app's ID

// Log in your application
logger.info("Starting LLM request", { requestId: appRequestId });

// Use same ID in Helicone
await client.chat.completions.create(
  { /* ... */ },
  {
    headers: {
      "Helicone-Request-Id": appRequestId,
    }
  }
);

// Later, search Helicone by your ID
// URL: https://helicone.ai/requests?requestId=your-app-id-123

Best Practices

Add context headers: Include user ID, feature, environment, and version in every request

Use sessions for multi-step flows: Group related requests to see full context

Set up alerts early: Don’t wait for users to report issues

Compare before/after: Use version tags to measure impact of changes

Remove sensitive information from prompts before logging. Consider using environment variables or secure vaults for API keys.

Debugging Checklist

When investigating an issue:

Filter by relevant properties (user, feature, environment)
Check error rates and status codes
Review request/response payloads
Look for patterns (time-based, user-based, feature-based)
Check related requests (sessions)
Compare with working requests
Test fix with version tracking
Set up alert to catch recurrence

Next Steps

Alerts

Set up proactive monitoring for errors and anomalies

Sessions

Track multi-step workflows for better context

Custom Properties

Add metadata for powerful filtering and debugging

Webhooks

Get notified immediately when issues occur

​What You’ll Learn

​Prerequisites

​Common Debugging Scenarios

​Step 1: Enable Comprehensive Logging

​Scenario 1: Finding and Fixing Errors

​Problem: Users reporting 500 errors

​Scenario 2: Debugging Unexpected Outputs

​Problem: Model generating incorrect format

​Scenario 3: Identifying Latency Issues

​Problem: Slow response times

​Scenario 4: Investigating Cost Spikes

​Problem: Unexpected $500 charge

​Scenario 5: User-Reported Issue

​Problem: “User user-456 says chatbot gave wrong answer yesterday”

​Advanced: Custom Request IDs

​Best Practices

​Debugging Checklist

​Next Steps

Alerts

Sessions

Custom Properties

Webhooks

What You’ll Learn

Prerequisites

Common Debugging Scenarios

Step 1: Enable Comprehensive Logging

Scenario 1: Finding and Fixing Errors

Problem: Users reporting 500 errors

Scenario 2: Debugging Unexpected Outputs

Problem: Model generating incorrect format

Scenario 3: Identifying Latency Issues

Problem: Slow response times

Scenario 4: Investigating Cost Spikes

Problem: Unexpected $500 charge

Scenario 5: User-Reported Issue

Problem: “User user-456 says chatbot gave wrong answer yesterday”

Advanced: Custom Request IDs

Best Practices

Debugging Checklist

Next Steps