Rate Limits

Understand and work within API rate limits to build reliable applications.

Overview

Rate limits protect the API from abuse and ensure fair usage for all users. Limits are applied per API key and measured in:

Requests per minute (RPM) — How many API calls you can make
Tokens per minute (TPM) — How many tokens you can process
Requests per day (RPD) — Daily request quota

Default Limits

Tier	RPM	TPM	RPD
Free New accounts	20	40,000	500
Pro €20+ spend	100	200,000	5,000
Business €100+ spend	500	1,000,000	25,000
Enterprise Contact sales	Custom	Custom	Unlimited

Your tier automatically upgrades based on your cumulative spending. Check your current limits in the dashboard.

Rate Limit Headers

Every response includes headers to help you track your usage:

Header	Description
`x-ratelimit-limit-requests`	Maximum requests per minute
`x-ratelimit-limit-tokens`	Maximum tokens per minute
`x-ratelimit-remaining-requests`	Requests remaining this window
`x-ratelimit-remaining-tokens`	Tokens remaining this window
`x-ratelimit-reset-requests`	Time until request limit resets (seconds)
`x-ratelimit-reset-tokens`	Time until token limit resets (seconds)
`retry-after`	Seconds to wait before retrying (on 429)

TypeScript

// Check rate limit headers in the response
const response = await fetch('https://api.llmhub.dev/v1/chat/completions', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer your-api-key',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: 'Hello!' }]
  })
});

// Rate limit headers
console.log('Requests remaining:', response.headers.get('x-ratelimit-remaining-requests'));
console.log('Tokens remaining:', response.headers.get('x-ratelimit-remaining-tokens'));
console.log('Limit resets at:', response.headers.get('x-ratelimit-reset-requests'));

Rate Limit Errors

When you exceed rate limits, the API returns a 429 Too Many Requests response:

json

{
  "error": {
    "message": "Rate limit exceeded. Please retry after 30 seconds.",
    "type": "rate_limit_error",
    "param": null,
    "code": "rate_limit_exceeded"
  }
}

Retry with Exponential Backoff

Implement automatic retries with exponential backoff for resilient applications:

TypeScript

async function callWithRetry(
  fn: () => Promise<Response>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<Response> {
  let lastError: Error | null = null;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await fn();
      
      // Success
      if (response.ok) {
        return response;
      }
      
      // Rate limited - wait and retry
      if (response.status === 429) {
        const retryAfter = response.headers.get('retry-after');
        const delay = retryAfter 
          ? parseInt(retryAfter) * 1000 
          : baseDelay * Math.pow(2, attempt);
        
        console.log(`Rate limited. Retrying in ${delay}ms...`);
        await sleep(delay);
        continue;
      }
      
      // Other error - don't retry
      throw new Error(`API error: ${response.status}`);
      
    } catch (error) {
      lastError = error as Error;
      
      // Network error - retry with backoff
      const delay = baseDelay * Math.pow(2, attempt);
      console.log(`Request failed. Retrying in ${delay}ms...`);
      await sleep(delay);
    }
  }
  
  throw lastError || new Error('Max retries exceeded');
}

const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));

Request Throttling

Proactively limit your request rate to avoid hitting limits:

TypeScript

class RateLimiter {
  private queue: Array<() => void> = [];
  private running = 0;
  private maxConcurrent: number;
  private minDelay: number;
  private lastRequest = 0;

  constructor(maxConcurrent = 5, requestsPerSecond = 10) {
    this.maxConcurrent = maxConcurrent;
    this.minDelay = 1000 / requestsPerSecond;
  }

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    // Wait for a slot
    while (this.running >= this.maxConcurrent) {
      await new Promise<void>(resolve => this.queue.push(resolve));
    }

    // Ensure minimum delay between requests
    const now = Date.now();
    const timeSinceLastRequest = now - this.lastRequest;
    if (timeSinceLastRequest < this.minDelay) {
      await new Promise(r => setTimeout(r, this.minDelay - timeSinceLastRequest));
    }

    this.running++;
    this.lastRequest = Date.now();

    try {
      return await fn();
    } finally {
      this.running--;
      const next = this.queue.shift();
      if (next) next();
    }
  }
}

// Usage
const limiter = new RateLimiter(5, 10); // 5 concurrent, 10 req/sec

const results = await Promise.all(
  prompts.map(prompt => 
    limiter.execute(() => 
      client.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [{ role: 'user', content: prompt }]
      })
    )
  )
);

Batch Processing

Process large workloads efficiently with controlled batching:

TypeScript

// Instead of sending requests one by one...
// ❌ Bad: 100 separate requests
for (const item of items) {
  await processItem(item);
}

// ✅ Good: Batch requests with controlled concurrency
async function processBatch<T, R>(
  items: T[],
  processor: (item: T) => Promise<R>,
  batchSize = 10
): Promise<R[]> {
  const results: R[] = [];
  
  for (let i = 0; i < items.length; i += batchSize) {
    const batch = items.slice(i, i + batchSize);
    const batchResults = await Promise.all(batch.map(processor));
    results.push(...batchResults);
    
    // Optional: Add delay between batches
    if (i + batchSize < items.length) {
      await new Promise(r => setTimeout(r, 100));
    }
  }
  
  return results;
}

// Process 100 items in batches of 10
const results = await processBatch(items, processItem, 10);

Best Practices

Monitor Rate Limit Headers

Track x-ratelimit-remaining-* headers and slow down before hitting limits.

Use Exponential Backoff

Start with a 1-second delay and double it on each retry (1s, 2s, 4s, 8s). Add jitter to prevent thundering herd.

Implement Request Queuing

Queue requests and process them at a controlled rate instead of sending all at once.

Cache Responses

Cache identical requests to reduce API calls. Embeddings are particularly good candidates for caching.

Use Smaller Models

GPT-4o-mini processes faster and has higher token limits than GPT-4o. Use it when quality requirements allow.

Increasing Your Limits

Automatic Tier Upgrades

Your limits automatically increase as you spend more. Each tier unlocks higher limits.

Enterprise Plans

Need custom limits? Contact enterprise@llmhub.dev to discuss your requirements.

Next Steps

Error Handling →

Handle all error types gracefully.

Billing →

Upgrade your tier for higher limits.