JamAI Base Docs
  • GETTING STARTED
    • Welcome to JamAI Base
      • Why Choose JamAI Base?
      • Key Features
      • Architecture
    • Use Case
      • Chatbot (Frontend Only)
      • Chatbot
      • Create a simple food recommender with JamAI and Flutter
      • Create a simple fitness planner app with JamAI and Next.js with streaming text.
      • Customer Service Chatbot
      • Women Clothing Reviews Analysis Dashboard
      • Geological Survey Investigation Report Generation
      • Medical Insurance Underwriting Automation
      • Medical Records Extraction
    • Frequently Asked Questions (FAQ)
    • Quick Start
      • Quick Start with Chat Table
      • Quick Start: Action Table (Multimodal)
      • ReactJS
      • Next JS
      • SvelteKit
      • Nuxt
      • NLUX (Frontend Only)
      • NLUX + Express.js
  • Using The Platform
    • Action Table
    • Chat Table
    • Knowledge Table
    • Supported Models
      • Which LLM Should You Choose?
      • Comparative Analysis of Large Language Models in Vision Tasks
    • Roadmap
  • 🦉API
    • OpenAPI
    • TS/JS SDK
  • 🦅SDK
    • Flutter
    • TS/JS
    • Python SDK Documentation
      • Quick Start with Chat Table
      • Quick Start: Action Table (Mutimodal)
        • Action Table - Image
        • Action Table - Audio
      • Quick Start: Knowledge Table File Upload
Powered by GitBook
On this page
  • Introduction
  • TL;DR
  • Pricing and Quality

Was this helpful?

  1. Using The Platform
  2. Supported Models

Comparative Analysis of Large Language Models in Vision Tasks

Examining GPT-4 Turbo, Claude-3 Family, and Gemini for Performance and Value.

PreviousWhich LLM Should You Choose?NextRoadmap

Last updated 1 year ago

Was this helpful?

Introduction

This documentation provides an analysis of various Large Language Models (LLMs) with a focus on their performance in vision-related tasks. The models under consideration include GPT-4 Turbo, Claude-3 Haiku, Claude-3 Opus, Claude-3 Sonnet, Gemini-1.0 Pro, and Gemini-1.5 Pro. The analysis covers aspects such as pricing, overall quality, and performance in specific vision and reasoning tasks.

TL;DR

  • Budget-conscious start: Try Claude-3 Haiku first; it offers the best quality score for the lowest price.

  • Quality boost: If you find text generation quality is crucial for your project, consider upgrading to GPT-4 Turbo. It has a solid quality score for text generation, though it comes at a higher cost.

  • Other options: The Claude-3 models (Opus and Sonnet) and the Gemini-1.5 Pro Vision might be suitable depending on your specific use case and quality vs. cost trade-offs.

Model Overview

The models evaluated in this analysis are:

  • GPT-4 Turbo: A high-performance model known for text generation quality. Can do vision pretty well.

  • Claude-3 Haiku: Optimized for fast and cost-effective performance, making it ideal for most generic enterprise tasks.

  • Claude-3 Opus: Offers the highest quality among the models, suitable for tasks requiring deep understanding.

  • Claude-3 Sonnet: Provides the best performance/price ratio, making it a versatile choice.

  • Gemini-1.0 Pro: A model designed for vision tasks, with a focus on affordability.

  • Gemini-1.5 Pro: An updated version of Gemini-1.0 Pro, offering improved quality.

Pricing and Quality

Price reflects the cost of processing one full-resolution image, along with 100 tokens for input (prompt) and 100 tokens for output.

Average quality is determined by the combined scores of MMMU, DocVQA, MathVista, AI2D, and ChartQA.

Model
Price
Average quality

GPT-4 Vision

0.01505

70.36%

Claude-3 Haiku

0.00055

70.76%

Claude-3 Opus

0.033

73.62%

Claude-3 Sonnet

0.0066

72.06%

Gemini-1.0 Pro Vision

0.0027

65.84%

Gemini-1.5 Pro Vision

Unknown

71.74%

  • Note about resolution:

    • GPT-4 Turbo: Images are first scaled to fit within a 2048 x 2048 square while maintaining their aspect ratio. They are then resized so that the shortest side of the image is 768 pixels long.

    • Claude-3: Can handle slightly larger images, for example, 1092 x 1092 pixels for a 1:1 aspect ratio and 819 x 1456 pixels for a 9:16 aspect ratio.

    • Gemini: No information is available on its image resolution capabilities.

This scatter plot compares the quality and price of several vision LLMs. Average quality is determined by the combined scores of MMMU, DocVQA, MathVista, AI2D, and ChartQA. Price reflects the cost of processing one full-resolution image, along with 100 tokens for input (prompt) and 100 tokens for output.
This radar chart comparing the performance of vision-based language models across five diverse tasks: Math & reasoning (MMMU), DocVQA, MathVista, Science diagrams (AI2D), and ChartQA.