Rate Limits

Understand and work within API rate limits to build reliable applications.

Overview

Rate limits protect the API from abuse and ensure fair usage for all users. Limits are applied per API key and measured in:

  • Requests per minute (RPM) — How many API calls you can make
  • Tokens per minute (TPM) — How many tokens you can process
  • Requests per day (RPD) — Daily request quota

Default Limits

TierRPMTPMRPD
Free
New accounts
2040,000500
Pro
€20+ spend
100200,0005,000
Business
€100+ spend
5001,000,00025,000
Enterprise
Contact sales
CustomCustomUnlimited

Your tier automatically upgrades based on your cumulative spending. Check your current limits in the dashboard.

Rate Limit Headers

Every response includes headers to help you track your usage:

HeaderDescription
x-ratelimit-limit-requestsMaximum requests per minute
x-ratelimit-limit-tokensMaximum tokens per minute
x-ratelimit-remaining-requestsRequests remaining this window
x-ratelimit-remaining-tokensTokens remaining this window
x-ratelimit-reset-requestsTime until request limit resets (seconds)
x-ratelimit-reset-tokensTime until token limit resets (seconds)
retry-afterSeconds to wait before retrying (on 429)
TypeScript
// Check rate limit headers in the response
const response = await fetch('https://api.llmhub.dev/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer your-api-key',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hello!' }]
  })
});

// Rate limit headers
console.log('Requests remaining:', response.headers.get('x-ratelimit-remaining-requests'));
console.log('Tokens remaining:', response.headers.get('x-ratelimit-remaining-tokens'));
console.log('Limit resets at:', response.headers.get('x-ratelimit-reset-requests'));

Rate Limit Errors

When you exceed rate limits, the API returns a 429 Too Many Requests response:

json
{
  "error": {
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "type": "rate_limit_error",
    "param": null,
    "code": "rate_limit_exceeded"
  }
}

Retry with Exponential Backoff

Implement automatic retries with exponential backoff for resilient applications:

TypeScript
async function callWithRetry(
  fn: () => Promise<Response>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<Response> {
  let lastError: Error | null = null;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await fn();
      
      // Success
      if (response.ok) {
        return response;
      }
      
      // Rate limited - wait and retry
      if (response.status === 429) {
        const retryAfter = response.headers.get('retry-after');
        const delay = retryAfter 
          ? parseInt(retryAfter) * 1000 
          : baseDelay * Math.pow(2, attempt);
        
        console.log(`Rate limited. Retrying in ${delay}ms...`);
        await sleep(delay);
        continue;
      }
      
      // Other error - don't retry
      throw new Error(`API error: ${response.status}`);
      
    } catch (error) {
      lastError = error as Error;
      
      // Network error - retry with backoff
      const delay = baseDelay * Math.pow(2, attempt);
      console.log(`Request failed. Retrying in ${delay}ms...`);
      await sleep(delay);
    }
  }
  
  throw lastError || new Error('Max retries exceeded');
}

const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));

Request Throttling

Proactively limit your request rate to avoid hitting limits:

TypeScript
class RateLimiter {
  private queue: Array<() => void> = [];
  private running = 0;
  private maxConcurrent: number;
  private minDelay: number;
  private lastRequest = 0;

  constructor(maxConcurrent = 5, requestsPerSecond = 10) {
    this.maxConcurrent = maxConcurrent;
    this.minDelay = 1000 / requestsPerSecond;
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    // Wait for a slot
    while (this.running >= this.maxConcurrent) {
      await new Promise<void>(resolve => this.queue.push(resolve));
    }

    // Ensure minimum delay between requests
    const now = Date.now();
    const timeSinceLastRequest = now - this.lastRequest;
    if (timeSinceLastRequest < this.minDelay) {
      await new Promise(r => setTimeout(r, this.minDelay - timeSinceLastRequest));
    }

    this.running++;
    this.lastRequest = Date.now();

    try {
      return await fn();
    } finally {
      this.running--;
      const next = this.queue.shift();
      if (next) next();
    }
  }
}

// Usage
const limiter = new RateLimiter(5, 10); // 5 concurrent, 10 req/sec

const results = await Promise.all(
  prompts.map(prompt => 
    limiter.execute(() => 
      client.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [{ role: 'user', content: prompt }]
      })
    )
  )
);

Batch Processing

Process large workloads efficiently with controlled batching:

TypeScript
// Instead of sending requests one by one...
// ❌ Bad: 100 separate requests
for (const item of items) {
  await processItem(item);
}

// ✅ Good: Batch requests with controlled concurrency
async function processBatch<T, R>(
  items: T[],
  processor: (item: T) => Promise<R>,
  batchSize = 10
): Promise<R[]> {
  const results: R[] = [];
  
  for (let i = 0; i < items.length; i += batchSize) {
    const batch = items.slice(i, i + batchSize);
    const batchResults = await Promise.all(batch.map(processor));
    results.push(...batchResults);
    
    // Optional: Add delay between batches
    if (i + batchSize < items.length) {
      await new Promise(r => setTimeout(r, 100));
    }
  }
  
  return results;
}

// Process 100 items in batches of 10
const results = await processBatch(items, processItem, 10);

Best Practices

Monitor Rate Limit Headers

Track x-ratelimit-remaining-* headers and slow down before hitting limits.

Use Exponential Backoff

Start with a 1-second delay and double it on each retry (1s, 2s, 4s, 8s). Add jitter to prevent thundering herd.

Implement Request Queuing

Queue requests and process them at a controlled rate instead of sending all at once.

Cache Responses

Cache identical requests to reduce API calls. Embeddings are particularly good candidates for caching.

Use Smaller Models

GPT-4o-mini processes faster and has higher token limits than GPT-4o. Use it when quality requirements allow.

Increasing Your Limits

Automatic Tier Upgrades

Your limits automatically increase as you spend more. Each tier unlocks higher limits.

Enterprise Plans

Need custom limits? Contact enterprise@llmhub.dev to discuss your requirements.

Next Steps