# Comparative Analysis of Large Language Models in Vision Tasks

### Introduction <a href="#introduction" id="introduction"></a>

This documentation provides an analysis of various Large Language Models (LLMs) with a focus on their performance in vision-related tasks. The models under consideration include GPT-4 Turbo, Claude-3 Haiku, Claude-3 Opus, Claude-3 Sonnet, Gemini-1.0 Pro, and Gemini-1.5 Pro. The analysis covers aspects such as pricing, overall quality, and performance in specific vision and reasoning tasks.

### TL;DR

* **Budget-conscious start:** Try Claude-3 Haiku first; it offers the best quality score for the lowest price.
* **Quality boost:** If you find text generation quality is crucial for your project, consider upgrading to GPT-4 Turbo. It has a solid quality score for text generation, though it comes at a higher cost.
* **Other options:** The Claude-3 models (Opus and Sonnet) and the Gemini-1.5 Pro Vision might be suitable depending on your specific use case and quality vs. cost trade-offs.

<figure><img src="/files/yJO0KzJlSPddnRrUArD1" alt=""><figcaption><p>This scatter plot compares the quality and price of several vision LLMs. Average quality is determined by the combined scores of MMMU, DocVQA, MathVista, AI2D, and ChartQA. Price reflects the cost of processing one full-resolution image, along with 100 tokens for input (prompt) and 100 tokens for output.</p></figcaption></figure>

<figure><img src="/files/zZ7gKse5rDa35KBHEa9R" alt="" width="375"><figcaption><p>This radar chart comparing the performance of vision-based language models across five diverse tasks: Math &#x26; reasoning (MMMU), DocVQA, MathVista, Science diagrams (AI2D), and ChartQA.</p></figcaption></figure>

**Model Overview**

The models evaluated in this analysis are:

* **GPT-4 Turbo**: A high-performance model known for text generation quality. Can do vision pretty well.
* **Claude-3 Haiku**: Optimized for fast and cost-effective performance, making it ideal for most generic enterprise tasks.
* **Claude-3 Opus**: Offers the highest quality among the models, suitable for tasks requiring deep understanding.
* **Claude-3 Sonnet**: Provides the best performance/price ratio, making it a versatile choice.
* **Gemini-1.0 Pro**: A model designed for vision tasks, with a focus on affordability.
* **Gemini-1.5 Pro**: An updated version of Gemini-1.0 Pro, offering improved quality.

### Pricing and Quality <a href="#pricing-and-quality" id="pricing-and-quality"></a>

Price reflects the cost of processing one full-resolution image, along with 100 tokens for input (prompt) and 100 tokens for output.&#x20;

Average quality is determined by the combined scores of MMMU, DocVQA, MathVista, AI2D, and ChartQA.

| Model                 | Price   | Average quality |
| --------------------- | ------- | --------------- |
| GPT-4 Vision          | 0.01505 | 70.36%          |
| Claude-3 Haiku        | 0.00055 | 70.76%          |
| Claude-3 Opus         | 0.033   | 73.62%          |
| Claude-3 Sonnet       | 0.0066  | 72.06%          |
| Gemini-1.0 Pro Vision | 0.0027  | 65.84%          |
| Gemini-1.5 Pro Vision | Unknown | 71.74%          |

* **Note about resolution:**
  * **GPT-4 Turbo:** Images are first scaled to fit within a 2048 x 2048 square while maintaining their aspect ratio. They are then resized so that the shortest side of the image is 768 pixels long.
  * **Claude-3:** Can handle slightly larger images, for example, 1092 x 1092 pixels for a 1:1 aspect ratio and 819 x 1456 pixels for a 9:16 aspect ratio.
  * **Gemini:** No information is available on its image resolution capabilities.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.jamaibase.com/using-the-platform/model-providers/comparative-analysis-of-large-language-models-in-vision-tasks.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
