Best Practices

Guidelines for building reliable, cost-effective, and performant AI applications.

Prompt Engineering

Be Specific and Clear

Vague prompts lead to unpredictable results. Be explicit about what you want.

❌ Vague

"Summarize this article"

✓ Specific

"Summarize this article in 3 bullet points, focusing on the key takeaways for software developers. Use plain language and avoid jargon."

Use System Messages Effectively

The system message sets context and constraints that persist throughout the conversation.

TypeScript
const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    {
      role: 'system',
      content: `You are a helpful customer support agent for TechCo.
      
Rules:
- Be friendly and professional
- If you don't know something, say so
- Never discuss competitors
- Keep responses under 100 words
- Always offer to escalate to a human if the issue is complex`
    },
    {
      role: 'user',
      content: 'My order hasn\'t arrived yet.'
    }
  ]
});

Provide Examples (Few-Shot)

Show the model what you want with examples.

TypeScript
const response = await client.chat.completions.create({
  model: 'gpt-4o',
  messages: [
    {
      role: 'system',
      content: 'Convert natural language to SQL queries.'
    },
    { role: 'user', content: 'Get all users from California' },
    { role: 'assistant', content: "SELECT * FROM users WHERE state = 'CA';" },
    { role: 'user', content: 'Find orders over $100 from last month' },
    { role: 'assistant', content: "SELECT * FROM orders WHERE amount > 100 AND created_at >= DATE_SUB(NOW(), INTERVAL 1 MONTH);" },
    { role: 'user', content: 'Count active subscriptions by plan' }
    // Model will follow the established pattern
  ]
});

Cost Optimization

Choose the Right Model

Not every task needs GPT-4o. Use gpt-4o-mini for simple tasks—it's 15-20x cheaper and often sufficient.

Limit Response Length

Set max_tokens to prevent unexpectedly long responses. Output tokens cost 2-4x more than input tokens.

Trim Context

Only include relevant conversation history. Summarize or truncate older messages instead of sending the full chat log.

Cache Responses

Cache identical requests. For embeddings, cache vectors in a database instead of re-computing.

Batch Requests

For embeddings and bulk processing, batch multiple items in a single request to reduce overhead.

Reliability

Implement Retries

Use exponential backoff for transient failures:

TypeScript
async function withRetry<T>(
  fn: () => Promise<T>,
  maxRetries = 3,
  baseDelay = 1000
): Promise<T> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === maxRetries - 1) throw error;
      const delay = baseDelay * Math.pow(2, i) + Math.random() * 1000;
      await new Promise(r => setTimeout(r, delay));
    }
  }
  throw new Error('Max retries exceeded');
}

Set Timeouts

Always set reasonable timeouts for API calls. LLM responses can occasionally take longer than expected.

Validate Outputs

Don't trust model outputs blindly. Validate JSON structure, check for required fields, and sanitize before use.

Handle Rate Limits

Monitor rate limit headers and implement request queuing for high-volume applications.

Security

Protect API Keys

  • • Never expose API keys in client-side code
  • • Use environment variables
  • • Rotate keys regularly
  • • Use separate keys for dev/staging/production

Sanitize User Input

Prevent prompt injection by validating and sanitizing user inputs. Consider using structured formats instead of free text.

Don't Log Sensitive Data

Avoid logging full prompts or responses that may contain PII or sensitive information.

Validate Webhook Signatures

Always verify HMAC signatures on incoming webhooks to prevent spoofing.

Performance

Use Streaming

Enable streaming for chat UIs to show responses as they generate. This dramatically improves perceived latency.

Parallelize Independent Calls

When processing multiple items, run them concurrently:

TypeScript
// Process 10 items in parallel
const results = await Promise.all(
  items.map(item =>
    client.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: item }]
    })
  )
);

Use Smaller Models for Routing

Use a fast model to classify or route requests, then use a powerful model only when needed.

Testing & Evaluation

Create Test Suites

Build a dataset of test cases with expected outputs. Run regression tests when changing prompts.

Use LLM-as-Judge

For subjective quality, use a separate model to evaluate outputs against criteria.

Monitor in Production

Track latency, token usage, error rates, and user feedback. Log enough to debug issues without storing sensitive data.

A/B Test Prompts

When iterating on prompts, run A/B tests to measure impact on key metrics.

Related Guides