POST/v1/chat/completions

Chat Completions

Generate AI responses from a list of messages. This is the primary endpoint for conversational AI applications.

Basic Usage

Send a list of messages and receive a model-generated response:

TypeScript

const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain quantum computing in simple terms.' }
  ]
});

console.log(response.choices[0].message.content);

Request Body

Full request schema with all available parameters:

json

{
  "model": "gpt-4o",                    // Required: Model ID
  "messages": [                          // Required: Array of messages
    {
      "role": "system",                  // system, user, or assistant
      "content": "You are helpful."
    },
    {
      "role": "user",
      "content": "Hello!"                // String or array for vision
    }
  ],
  "temperature": 0.7,                    // Optional: 0-2, default 1
  "max_tokens": 1000,                    // Optional: Max output tokens
  "top_p": 1,                            // Optional: Nucleus sampling
  "frequency_penalty": 0,                // Optional: -2 to 2
  "presence_penalty": 0,                 // Optional: -2 to 2
  "stop": ["\n"],                        // Optional: Stop sequences
  "stream": false,                       // Optional: Enable streaming
  "tools": [],                           // Optional: Function definitions
  "tool_choice": "auto",                 // Optional: Tool selection mode
  "response_format": { "type": "json_object" }  // Optional: JSON mode
}

Parameters

Parameter	Type	Description
`model`Required	string	ID of the model to use (e.g., "gpt-4o", "claude-3.5-sonnet")
`messages`Required	array	Array of message objects with role and content
`temperature`	number	Sampling temperature (0-2). Higher = more random. Default: 1
`max_tokens`	integer	Maximum tokens to generate. Model-dependent default.
`top_p`	number	Nucleus sampling. Alternative to temperature. Default: 1
`stream`	boolean	Enable streaming responses via SSE. Default: false
`stop`	string \| array	Up to 4 sequences where the API will stop generating
`frequency_penalty`	number	Penalize repeated tokens (-2 to 2). Default: 0
`presence_penalty`	number	Penalize tokens that appear in the text (-2 to 2). Default: 0

Response

The API returns a chat completion object:

json

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706123456,
  "model": "gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"            // stop, length, tool_calls
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 12,
    "total_tokens": 37
  }
}

Finish Reasons

Value	Meaning
`stop`	Model finished naturally or hit a stop sequence
`length`	Hit max_tokens limit
`tool_calls`	Model wants to call a function/tool
`content_filter`	Content was filtered due to policy

Streaming

Enable real-time streaming to receive tokens as they're generated. Set stream: true in your request:

TypeScript

const stream = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Write a haiku about programming' }],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(content);
}

Learn more about streaming →

Multi-turn Conversations

Include previous messages in the conversation to maintain context:

TypeScript

const messages = [
  { role: 'system', content: 'You are a helpful coding assistant.' },
  { role: 'user', content: 'How do I read a file in Python?' },
  { role: 'assistant', content: 'You can use the built-in open() function...' },
  { role: 'user', content: 'What about reading it line by line?' }
];

const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages
});

Tip: The system message sets the AI's personality and behavior. Include it at the start of every conversation.

Controlling Output

Fine-tune the response using these parameters:

TypeScript

const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  
  // Control randomness (0 = deterministic, 2 = very random)
  temperature: 0.7,
  
  // Limit response length
  max_tokens: 500,
  
  // Alternative to temperature
  top_p: 0.9,
  
  // Penalize repetition
  frequency_penalty: 0.5,
  presence_penalty: 0.5,
  
  // Stop generation at specific strings
  stop: ['THE END', '\n\n']
});

Temperature vs Top P

Both control randomness. Use one or the other, not both. Temperature is more intuitive (0 = focused, 2 = creative). Top P uses nucleus sampling (0.1 = only top 10% of probability mass).

Frequency vs Presence Penalty

Frequency penalty reduces repetition of exact phrases. Presence penalty encourages talking about new topics. Use 0.5-1.0 for both to reduce repetition.

cURL Example

Test the API directly from your terminal:

Bash

curl https://api.llmhub.one/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LLMHUB_API_KEY" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

Chat Completions

Basic Usage

Request Body

Parameters

Response

Finish Reasons

Streaming

Multi-turn Conversations

Controlling Output

Temperature vs Top P

Frequency vs Presence Penalty

cURL Example

Related Guides

Streaming →

Vision →

Function Calling →

Structured Outputs →