Comparative Analysis of Large Language Models in Vision Tasks
Examining GPT-4 Turbo, Claude-3 Family, and Gemini for Performance and Value.
Introduction
This documentation provides an analysis of various Large Language Models (LLMs) with a focus on their performance in vision-related tasks. The models under consideration include GPT-4 Turbo, Claude-3 Haiku, Claude-3 Opus, Claude-3 Sonnet, Gemini-1.0 Pro, and Gemini-1.5 Pro. The analysis covers aspects such as pricing, overall quality, and performance in specific vision and reasoning tasks.
TL;DR
Budget-conscious start: Try Claude-3 Haiku first; it offers the best quality score for the lowest price.
Quality boost: If you find text generation quality is crucial for your project, consider upgrading to GPT-4 Turbo. It has a solid quality score for text generation, though it comes at a higher cost.
Other options: The Claude-3 models (Opus and Sonnet) and the Gemini-1.5 Pro Vision might be suitable depending on your specific use case and quality vs. cost trade-offs.
Model Overview
The models evaluated in this analysis are:
GPT-4 Turbo: A high-performance model known for text generation quality. Can do vision pretty well.
Claude-3 Haiku: Optimized for fast and cost-effective performance, making it ideal for most generic enterprise tasks.
Claude-3 Opus: Offers the highest quality among the models, suitable for tasks requiring deep understanding.
Claude-3 Sonnet: Provides the best performance/price ratio, making it a versatile choice.
Gemini-1.0 Pro: A model designed for vision tasks, with a focus on affordability.
Gemini-1.5 Pro: An updated version of Gemini-1.0 Pro, offering improved quality.
Pricing and Quality
Price reflects the cost of processing one full-resolution image, along with 100 tokens for input (prompt) and 100 tokens for output.
Average quality is determined by the combined scores of MMMU, DocVQA, MathVista, AI2D, and ChartQA.
Model | Price | Average quality |
---|---|---|
GPT-4 Vision | 0.01505 | 70.36% |
Claude-3 Haiku | 0.00055 | 70.76% |
Claude-3 Opus | 0.033 | 73.62% |
Claude-3 Sonnet | 0.0066 | 72.06% |
Gemini-1.0 Pro Vision | 0.0027 | 65.84% |
Gemini-1.5 Pro Vision | Unknown | 71.74% |
Note about resolution:
GPT-4 Turbo: Images are first scaled to fit within a 2048 x 2048 square while maintaining their aspect ratio. They are then resized so that the shortest side of the image is 768 pixels long.
Claude-3: Can handle slightly larger images, for example, 1092 x 1092 pixels for a 1:1 aspect ratio and 819 x 1456 pixels for a 9:16 aspect ratio.
Gemini: No information is available on its image resolution capabilities.
Last updated