Streaming Responses

Receive AI responses in real-time as they're generated, creating a more responsive user experience.

Overview

Streaming allows you to receive partial responses as the model generates them, rather than waiting for the complete response. This is ideal for:

  • Chat interfaces where users expect immediate feedback
  • Long-form content generation (stories, articles, code)
  • Reducing perceived latency in your application
  • Handling large responses without timeout issues

Basic Usage

Set stream: true to enable streaming:

TypeScript
const stream = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Write a short story about a robot.' }],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(content);
}

Stream Chunk Format

Each chunk contains partial content in the delta field:

json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "created": 1706123456,
  "model": "gpt-4o",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "Hello"
      },
      "finish_reason": null
    }
  ]
}

Key Differences from Non-Streaming

  • • Object type is chat.completion.chunk instead of chat.completion
  • • Content is in delta.content instead of message.content
  • finish_reason is null until the last chunk
  • • No usage field (tokens counted after streaming completes)

Python Example

Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.llmhub.one/v1",
    api_key="your-api-key"
)

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True
)

for chunk in stream:
    content = chunk.choices[0].delta.content or ""
    print(content, end="", flush=True)

React Integration

Here's how to implement streaming in a Next.js application:

Server Route

TypeScript
// app/api/chat/route.ts
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://api.llmhub.one/v1',
  apiKey: process.env.LLMHUB_API_KEY!
});

export async function POST(request: Request) {
  const { prompt } = await request.json();

  const stream = await client.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    stream: true
  });

  // Create a readable stream for the response
  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content || '';
        controller.enqueue(encoder.encode(content));
      }
      controller.close();
    }
  });

  return new Response(readable, {
    headers: { 'Content-Type': 'text/plain; charset=utf-8' }
  });
}

Client Component

TypeScript
'use client';

import { useState } from 'react';

export function Chat() {
  const [response, setResponse] = useState('');
  const [isLoading, setIsLoading] = useState(false);

  async function handleSubmit(prompt: string) {
    setIsLoading(true);
    setResponse('');

    const res = await fetch('/api/chat', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ prompt })
    });

    const reader = res.body?.getReader();
    const decoder = new TextDecoder();

    while (reader) {
      const { done, value } = await reader.read();
      if (done) break;
      
      const text = decoder.decode(value);
      setResponse(prev => prev + text);
    }

    setIsLoading(false);
  }

  return (
    <div>
      <pre>{response}</pre>
      {isLoading && <span>Generating...</span>}
    </div>
  );
}

Server-Sent Events (SSE)

The API uses the SSE protocol for streaming. Each chunk is prefixed withdata: and the stream ends with[DONE]:

TypeScript
// Server-Sent Events format
const stream = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
  stream: true
});

// Each chunk is sent as SSE:
// data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"Hi"}}]}

// Final message:
// data: [DONE]

Streaming with Function Calls

When using tools/functions with streaming, function call data is streamed progressively:

TypeScript
const stream = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'What is the weather in Paris?' }],
  tools: [{
    type: 'function',
    function: {
      name: 'get_weather',
      description: 'Get current weather',
      parameters: {
        type: 'object',
        properties: {
          location: { type: 'string' }
        }
      }
    }
  }],
  stream: true
});

let functionCall = { name: '', arguments: '' };

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta;
  
  if (delta?.tool_calls?.[0]) {
    const toolDelta = delta.tool_calls[0];
    if (toolDelta.function?.name) {
      functionCall.name = toolDelta.function.name;
    }
    if (toolDelta.function?.arguments) {
      functionCall.arguments += toolDelta.function.arguments;
    }
  }
  
  if (delta?.content) {
    process.stdout.write(delta.content);
  }
}

console.log('Function call:', functionCall);

Note: Function arguments are streamed as partial JSON strings. You need to concatenate them before parsing.

Error Handling

Connection Errors

If the connection drops mid-stream, catch the error and optionally retry with the partial content you've received so far.

Timeout Handling

Streaming requests can run longer than regular requests. Set appropriate timeouts (60-120 seconds) for long-form content.

Incomplete Responses

Check the finish_reason in the final chunk. If it's length, the response was truncated.

Best Practices

Buffer Output for UI

Consider buffering chunks (e.g., by word or sentence) instead of updating the UI on every token for smoother rendering.

Show Loading State

Display a typing indicator or cursor while streaming to show the AI is still generating.

Handle Aborts

Allow users to cancel generation. Use an AbortController to stop the stream and free up resources.

Related Guides