Streaming Responses
Receive AI responses in real-time as they're generated, creating a more responsive user experience.
Overview
Streaming allows you to receive partial responses as the model generates them, rather than waiting for the complete response. This is ideal for:
- Chat interfaces where users expect immediate feedback
- Long-form content generation (stories, articles, code)
- Reducing perceived latency in your application
- Handling large responses without timeout issues
Basic Usage
Set stream: true to enable streaming:
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Write a short story about a robot.' }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
process.stdout.write(content);
}Stream Chunk Format
Each chunk contains partial content in the delta field:
{
"id": "chatcmpl-abc123",
"object": "chat.completion.chunk",
"created": 1706123456,
"model": "gpt-4o",
"choices": [
{
"index": 0,
"delta": {
"content": "Hello"
},
"finish_reason": null
}
]
}Key Differences from Non-Streaming
- • Object type is
chat.completion.chunkinstead ofchat.completion - • Content is in
delta.contentinstead ofmessage.content - •
finish_reasonis null until the last chunk - • No
usagefield (tokens counted after streaming completes)
Python Example
from openai import OpenAI
client = OpenAI(
base_url="https://api.llmhub.one/v1",
api_key="your-api-key"
)
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a poem"}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content or ""
print(content, end="", flush=True)React Integration
Here's how to implement streaming in a Next.js application:
Server Route
// app/api/chat/route.ts
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.llmhub.one/v1',
apiKey: process.env.LLMHUB_API_KEY!
});
export async function POST(request: Request) {
const { prompt } = await request.json();
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
stream: true
});
// Create a readable stream for the response
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || '';
controller.enqueue(encoder.encode(content));
}
controller.close();
}
});
return new Response(readable, {
headers: { 'Content-Type': 'text/plain; charset=utf-8' }
});
}Client Component
'use client';
import { useState } from 'react';
export function Chat() {
const [response, setResponse] = useState('');
const [isLoading, setIsLoading] = useState(false);
async function handleSubmit(prompt: string) {
setIsLoading(true);
setResponse('');
const res = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt })
});
const reader = res.body?.getReader();
const decoder = new TextDecoder();
while (reader) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
setResponse(prev => prev + text);
}
setIsLoading(false);
}
return (
<div>
<pre>{response}</pre>
{isLoading && <span>Generating...</span>}
</div>
);
}Server-Sent Events (SSE)
The API uses the SSE protocol for streaming. Each chunk is prefixed withdata: and the stream ends with[DONE]:
// Server-Sent Events format
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello' }],
stream: true
});
// Each chunk is sent as SSE:
// data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"Hi"}}]}
// Final message:
// data: [DONE]Streaming with Function Calls
When using tools/functions with streaming, function call data is streamed progressively:
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'What is the weather in Paris?' }],
tools: [{
type: 'function',
function: {
name: 'get_weather',
description: 'Get current weather',
parameters: {
type: 'object',
properties: {
location: { type: 'string' }
}
}
}
}],
stream: true
});
let functionCall = { name: '', arguments: '' };
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta;
if (delta?.tool_calls?.[0]) {
const toolDelta = delta.tool_calls[0];
if (toolDelta.function?.name) {
functionCall.name = toolDelta.function.name;
}
if (toolDelta.function?.arguments) {
functionCall.arguments += toolDelta.function.arguments;
}
}
if (delta?.content) {
process.stdout.write(delta.content);
}
}
console.log('Function call:', functionCall);Note: Function arguments are streamed as partial JSON strings. You need to concatenate them before parsing.
Error Handling
Connection Errors
If the connection drops mid-stream, catch the error and optionally retry with the partial content you've received so far.
Timeout Handling
Streaming requests can run longer than regular requests. Set appropriate timeouts (60-120 seconds) for long-form content.
Incomplete Responses
Check the finish_reason in the final chunk. If it's length, the response was truncated.
Best Practices
Buffer Output for UI
Consider buffering chunks (e.g., by word or sentence) instead of updating the UI on every token for smoother rendering.
Show Loading State
Display a typing indicator or cursor while streaming to show the AI is still generating.
Handle Aborts
Allow users to cancel generation. Use an AbortController to stop the stream and free up resources.

