Vision
Analyze images using AI vision models. Extract text, describe scenes, compare images, and more.
Supported Models
| Model | Provider | Max Images | Notes |
|---|---|---|---|
| gpt-4o | OpenAI | 20 | Recommended |
| gpt-4o-mini | OpenAI | 20 | Faster, cheaper |
| claude-3.5-sonnet | Anthropic | 20 | Excellent at details |
| gemini-2.0-flash | 16 | Fast, good for video frames | |
| llama-3.2-90b-vision | Meta | 10 | Open source |
Basic Usage
Pass images in the message content array alongside text:
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'What is in this image?' },
{
type: 'image_url',
image_url: {
url: 'https://example.com/image.jpg'
}
}
]
}
]
});
console.log(response.choices[0].message.content);Base64 Images
Send images as base64-encoded data URLs for local files or generated images:
import fs from 'fs';
// Read image and convert to base64
const imageBuffer = fs.readFileSync('path/to/image.png');
const base64Image = imageBuffer.toString('base64');
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image in detail.' },
{
type: 'image_url',
image_url: {
url: `data:image/png;base64,${base64Image}`
}
}
]
}
]
});Supported formats: JPEG, PNG, GIF (first frame only), WebP
Multiple Images
Analyze multiple images in a single request for comparison or context:
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'text', text: 'Compare these two images. What are the differences?' },
{
type: 'image_url',
image_url: { url: 'https://example.com/image1.jpg' }
},
{
type: 'image_url',
image_url: { url: 'https://example.com/image2.jpg' }
}
]
}
]
});Detail Level
Control image analysis quality with the detail parameter:
{
type: 'image_url',
image_url: {
url: 'https://example.com/image.jpg',
detail: 'high' // 'low', 'high', or 'auto'
}
}
// 'low' - 512x512 fixed, faster and cheaper
// 'high' - Detailed analysis, uses more tokens
// 'auto' - Model decides based on image size (default)| Detail | Tokens | Best For |
|---|---|---|
low | ~85 tokens | Quick classification, thumbnails, simple scenes |
high | ~765+ tokens | OCR, detailed analysis, small text, fine details |
auto | Varies | Let the model choose based on image size |
Common Use Cases
OCR / Text Extraction
Extract text from documents, screenshots, or photos:
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Extract all text from this image. Return it as plain text.'
},
{
type: 'image_url',
image_url: { url: 'https://example.com/document.png' }
}
]
}
]
});Chart & Data Analysis
Analyze charts, graphs, and data visualizations:
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: `Analyze this chart and provide:
1. The type of chart
2. Main trends or patterns
3. Key data points
4. Any insights or conclusions`
},
{
type: 'image_url',
image_url: { url: 'https://example.com/chart.png' }
}
]
}
]
});Image Size & Costs
Size Limits
- • Maximum file size: 20 MB per image
- • Maximum dimensions: 2048 × 2048 pixels (images are resized)
- • Minimum dimensions: 10 × 10 pixels
Token Calculation
Images are converted to tokens based on their size and detail level. High detail images are split into 512×512 tiles, each costing ~170 tokens. A 1024×1024 high-detail image uses approximately 765 tokens.
Best Practices
Optimize Image Size
Resize large images before sending to reduce costs. For most tasks, 1024×1024 is sufficient quality.
Use Low Detail When Possible
For simple tasks like classification or general description, usedetail: "low"to save tokens.
Be Specific in Prompts
Tell the model exactly what to look for. "What text is in the top-right corner?" is better than "What's in this image?"
Cache URL Images
When using image URLs, ensure they're stable and fast to load. Consider using a CDN for frequently analyzed images.
Limitations
- • Cannot identify specific people (for privacy)
- • May struggle with very small text or low-contrast images
- • Animated GIFs: only the first frame is analyzed
- • Cannot process videos directly (extract frames first)
- • May misinterpret highly stylized or abstract images

