Skip to main content

Overview

The chat completions endpoint provides OpenAI-compatible chat completion functionality with unified access to multiple LLM providers through the Helicone AI Gateway.

Authentication

All requests to the AI Gateway require authentication using your Helicone API key in the Authorization header:
Authorization: Bearer YOUR_HELICONE_API_KEY

Endpoint

curl https://ai-gateway.helicone.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_HELICONE_API_KEY" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you?"
      }
    ]
  }'

Request Parameters

model
string
required
The model identifier to use for the completion (e.g., gpt-4, claude-3-opus-20240229)
messages
array
required
Array of message objects representing the conversation history. Each message must have a role and content.Supported roles:
  • system - System instructions
  • user - User messages
  • assistant - Assistant responses
  • tool - Tool/function call results
  • function - Legacy function call results
  • developer - Developer-level instructions
temperature
number
Sampling temperature between 0 and 2. Higher values make output more random. Default varies by model.
max_tokens
integer
Maximum number of tokens to generate in the completion.
max_completion_tokens
integer
Maximum number of completion tokens to generate (alternative to max_tokens).
top_p
number
Nucleus sampling parameter. Alternative to temperature. Value between 0 and 1.
top_k
number
Top-K sampling parameter for limiting token selection.
stream
boolean
default:"false"
Whether to stream the response as Server-Sent Events (SSE).
stream_options
object
Options for streaming responses.
stop
string | array
Up to 4 sequences where the API will stop generating further tokens.
n
integer
default:"1"
Number of chat completion choices to generate (1-128).
presence_penalty
number
default:"0"
Penalize new tokens based on whether they appear in the text so far (-2.0 to 2.0).
frequency_penalty
number
default:"0"
Penalize new tokens based on their frequency in the text so far (-2.0 to 2.0).
logit_bias
object
Modify the likelihood of specified tokens appearing in the completion.
logprobs
boolean
default:"false"
Whether to return log probabilities of output tokens.
top_logprobs
integer
Number of most likely tokens to return at each position (0-20). Requires logprobs: true.
response_format
object
Format for the model’s output.
tools
array
List of tools the model can call. Use this for function calling.
tool_choice
string | object
Controls which tool the model should use. Options: none, auto, required, or specific tool.
parallel_tool_calls
boolean
default:"true"
Whether to enable parallel function calling.
user
string
Unique identifier for the end-user, for monitoring and abuse detection.
seed
integer
Random seed for deterministic sampling.
service_tier
string
Service tier to use. Options: auto, default, flex, scale, priority
reasoning_effort
string
Amount of reasoning effort for reasoning models. Options: minimal, low, medium, high
reasoning_options
object
Options for reasoning models.
metadata
object
Custom metadata to attach to the request for tracking and filtering in Helicone.
cache_control
object
Cache control settings for prompt caching.
prompt_cache_key
string
Key for prompt caching to reuse previous prompts.

Response Format

Non-Streaming Response

id
string
Unique identifier for the completion.
object
string
Object type, always chat.completion.
created
integer
Unix timestamp of when the completion was created.
model
string
The model used for completion.
choices
array
Array of completion choices.
usage
object
Token usage information.

Streaming Response

When stream: true, the response is returned as Server-Sent Events (SSE). Each event contains a JSON object with:
{
  "id": "chatcmpl-123",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Hello"
      },
      "finish_reason": null
    }
  ]
}
The stream ends with a [DONE] message.

Error Responses

error
object
Error information when a request fails.

Example Responses

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking. How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 13,
    "completion_tokens": 17,
    "total_tokens": 30
  }
}

Advanced Features

Function Calling

Define tools that the model can use:
{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "What's the weather in Boston?"}],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string"}
          },
          "required": ["location"]
        }
      }
    }
  ]
}

Vision (Image Input)

Include images in your messages:
{
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/image.jpg",
            "detail": "high"
          }
        }
      ]
    }
  ]
}

JSON Mode

Force the model to output valid JSON:
{
  "model": "gpt-4",
  "messages": [{"role": "user", "content": "Generate a user profile"}],
  "response_format": {"type": "json_object"}
}

Rate Limits

Rate limits are applied at the organization level and vary based on your Helicone plan. Monitor your usage through the Helicone dashboard.

Best Practices

  • Always include error handling for API calls
  • Use streaming for better user experience with long responses
  • Set appropriate max_tokens to control costs
  • Use metadata to track and filter requests in Helicone
  • Implement retry logic with exponential backoff for transient errors