/v1/chat/completionsChat Completions
Generate AI responses from a list of messages. This is the primary endpoint for conversational AI applications.
Basic Usage
Send a list of messages and receive a model-generated response:
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain quantum computing in simple terms.' }
]
});
console.log(response.choices[0].message.content);Request Body
Full request schema with all available parameters:
{
"model": "gpt-4o", // Required: Model ID
"messages": [ // Required: Array of messages
{
"role": "system", // system, user, or assistant
"content": "You are helpful."
},
{
"role": "user",
"content": "Hello!" // String or array for vision
}
],
"temperature": 0.7, // Optional: 0-2, default 1
"max_tokens": 1000, // Optional: Max output tokens
"top_p": 1, // Optional: Nucleus sampling
"frequency_penalty": 0, // Optional: -2 to 2
"presence_penalty": 0, // Optional: -2 to 2
"stop": ["\n"], // Optional: Stop sequences
"stream": false, // Optional: Enable streaming
"tools": [], // Optional: Function definitions
"tool_choice": "auto", // Optional: Tool selection mode
"response_format": { "type": "json_object" } // Optional: JSON mode
}Parameters
| Parameter | Type | Description |
|---|---|---|
modelRequired | string | ID of the model to use (e.g., "gpt-4o", "claude-3.5-sonnet") |
messagesRequired | array | Array of message objects with role and content |
temperature | number | Sampling temperature (0-2). Higher = more random. Default: 1 |
max_tokens | integer | Maximum tokens to generate. Model-dependent default. |
top_p | number | Nucleus sampling. Alternative to temperature. Default: 1 |
stream | boolean | Enable streaming responses via SSE. Default: false |
stop | string | array | Up to 4 sequences where the API will stop generating |
frequency_penalty | number | Penalize repeated tokens (-2 to 2). Default: 0 |
presence_penalty | number | Penalize tokens that appear in the text (-2 to 2). Default: 0 |
Response
The API returns a chat completion object:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1706123456,
"model": "gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop" // stop, length, tool_calls
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 12,
"total_tokens": 37
}
}Finish Reasons
| Value | Meaning |
|---|---|
stop | Model finished naturally or hit a stop sequence |
length | Hit max_tokens limit |
tool_calls | Model wants to call a function/tool |
content_filter | Content was filtered due to policy |
Streaming
Enable real-time streaming to receive tokens as they're generated. Set stream: true in your request:
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Write a haiku about programming' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
process.stdout.write(content);
}Multi-turn Conversations
Include previous messages in the conversation to maintain context:
const messages = [
{ role: 'system', content: 'You are a helpful coding assistant.' },
{ role: 'user', content: 'How do I read a file in Python?' },
{ role: 'assistant', content: 'You can use the built-in open() function...' },
{ role: 'user', content: 'What about reading it line by line?' }
];
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages
});Tip: The system message sets the AI's personality and behavior. Include it at the start of every conversation.
Controlling Output
Fine-tune the response using these parameters:
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Tell me a story' }],
// Control randomness (0 = deterministic, 2 = very random)
temperature: 0.7,
// Limit response length
max_tokens: 500,
// Alternative to temperature
top_p: 0.9,
// Penalize repetition
frequency_penalty: 0.5,
presence_penalty: 0.5,
// Stop generation at specific strings
stop: ['THE END', '\n\n']
});Temperature vs Top P
Both control randomness. Use one or the other, not both. Temperature is more intuitive (0 = focused, 2 = creative). Top P uses nucleus sampling (0.1 = only top 10% of probability mass).
Frequency vs Presence Penalty
Frequency penalty reduces repetition of exact phrases. Presence penalty encourages talking about new topics. Use 0.5-1.0 for both to reduce repetition.
cURL Example
Test the API directly from your terminal:
curl https://api.llmhub.one/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LLMHUB_API_KEY" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 500
}'
