Comparative Analysis of Large Language Models in Vision Tasks

Examining GPT-4 Turbo, Claude-3 Family, and Gemini for Performance and Value.

Introduction

This documentation provides an analysis of various Large Language Models (LLMs) with a focus on their performance in vision-related tasks. The models under consideration include GPT-4 Turbo, Claude-3 Haiku, Claude-3 Opus, Claude-3 Sonnet, Gemini-1.0 Pro, and Gemini-1.5 Pro. The analysis covers aspects such as pricing, overall quality, and performance in specific vision and reasoning tasks.

TL;DR

  • Budget-conscious start: Try Claude-3 Haiku first; it offers the best quality score for the lowest price.

  • Quality boost: If you find text generation quality is crucial for your project, consider upgrading to GPT-4 Turbo. It has a solid quality score for text generation, though it comes at a higher cost.

  • Other options: The Claude-3 models (Opus and Sonnet) and the Gemini-1.5 Pro Vision might be suitable depending on your specific use case and quality vs. cost trade-offs.

Model Overview

The models evaluated in this analysis are:

  • GPT-4 Turbo: A high-performance model known for text generation quality. Can do vision pretty well.

  • Claude-3 Haiku: Optimized for fast and cost-effective performance, making it ideal for most generic enterprise tasks.

  • Claude-3 Opus: Offers the highest quality among the models, suitable for tasks requiring deep understanding.

  • Claude-3 Sonnet: Provides the best performance/price ratio, making it a versatile choice.

  • Gemini-1.0 Pro: A model designed for vision tasks, with a focus on affordability.

  • Gemini-1.5 Pro: An updated version of Gemini-1.0 Pro, offering improved quality.

Pricing and Quality

Price reflects the cost of processing one full-resolution image, along with 100 tokens for input (prompt) and 100 tokens for output.

Average quality is determined by the combined scores of MMMU, DocVQA, MathVista, AI2D, and ChartQA.

ModelPrice Average quality

GPT-4 Vision

0.01505

70.36%

Claude-3 Haiku

0.00055

70.76%

Claude-3 Opus

0.033

73.62%

Claude-3 Sonnet

0.0066

72.06%

Gemini-1.0 Pro Vision

0.0027

65.84%

Gemini-1.5 Pro Vision

Unknown

71.74%

  • Note about resolution:

    • GPT-4 Turbo: Images are first scaled to fit within a 2048 x 2048 square while maintaining their aspect ratio. They are then resized so that the shortest side of the image is 768 pixels long.

    • Claude-3: Can handle slightly larger images, for example, 1092 x 1092 pixels for a 1:1 aspect ratio and 819 x 1456 pixels for a 9:16 aspect ratio.

    • Gemini: No information is available on its image resolution capabilities.

Last updated