Multimodal AI in Practice: Adding Vision to Your TypeScript App

The first time I sent a screenshot to GPT-4o and asked "What is wrong with this UI?", I was genuinely surprised by the response. It identified a misaligned button, a contrast issue in the sidebar, and a truncated label — all from a PNG file. That moment made something click: AI is not just a text tool anymore. It sees.

The multimodal AI market hit $3.85 billion in 2026 and is growing at 29% annually. Google just released Gemini Embedding 2, the first natively multimodal embedding model that maps text, images, video, audio, and documents into a single embedding space. The convergence is real. If you build applications with TypeScript, adding vision capabilities is now a practical, production-ready option.

What Multimodal Actually Means for Developers

Traditional AI workflows processed each modality separately: one model for text, another for images, a third for audio. Combining their outputs required complex pipelines that lost context at the boundaries between modalities.

Modern multimodal models — GPT-4o, Claude, Gemini — process images and text through unified architectures that reason about the relationship between what they see and read simultaneously. You send an image and a text prompt in a single API call, and the model understands both in context.

For TypeScript developers, this means you can:

Analyze screenshots and mockups for UI issues
Extract structured data from documents, invoices, and forms
Process images alongside text in RAG pipelines
Build visual Q&A features without a separate computer vision stack

The Simplest Starting Point

Adding vision to an existing TypeScript application is surprisingly straightforward. Here is a minimal example using the OpenAI SDK:

import OpenAI from 'openai';

const openai = new OpenAI();

async function analyzeImage(
  imageUrl: string,
  question: string
): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'user',
        content: [
          { type: 'text', text: question },
          { type: 'image_url', image_url: { url: imageUrl } },
        ],
      },
    ],
    max_tokens: 1000,
  });

  return response.choices[0].message.content ?? '';
}

// Usage
const analysis = await analyzeImage(
  'https://example.com/dashboard-screenshot.png',
  'Identify any UI issues, accessibility problems, or layout bugs in this dashboard.'
);

That is it. No computer vision library, no image preprocessing pipeline, no separate ML model. One API call that understands both the image and the question about it.

Practical Use Cases I Have Built

Document Data Extraction

At Lit Alerts, some client data arrives as scanned PDFs and photos of physical documents. Instead of building an OCR pipeline with Tesseract and custom parsing logic, I send the images directly to GPT-4o with a structured output schema.

import { z } from 'zod';

const InvoiceData = z.object({
  vendor: z.string(),
  invoiceNumber: z.string(),
  date: z.string(),
  lineItems: z.array(z.object({
    description: z.string(),
    quantity: z.number(),
    unitPrice: z.number(),
    total: z.number(),
  })),
  totalAmount: z.number(),
  currency: z.string(),
});

async function extractInvoiceData(imageBase64: string) {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'system',
        content: `Extract invoice data from this image. Return JSON matching this schema: ${JSON.stringify(InvoiceData.shape)}`,
      },
      {
        role: 'user',
        content: [
          { type: 'image_url', image_url: { url: `data:image/jpeg;base64,${imageBase64}` } },
        ],
      },
    ],
  });

  const raw = JSON.parse(response.choices[0].message.content ?? '{}');
  return InvoiceData.parse(raw);
}

This handles handwritten invoices, photos taken at angles, and multi-language documents — things that traditional OCR pipelines struggle with. The accuracy is not perfect, but it is good enough for a human-in-the-loop review workflow where the AI does the initial extraction and a person verifies.

Visual QA for Support

For a product support tool, I built a feature where users upload a screenshot of their problem and the AI diagnoses it. The model sees the error message, the UI state, and the context — all in one image.

async function diagnoseFromScreenshot(
  screenshot: string,
  userDescription: string
): Promise<Diagnosis> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `You are a technical support specialist. Analyze the screenshot and the user's description to diagnose the issue. Provide a clear diagnosis and step-by-step resolution.`,
      },
      {
        role: 'user',
        content: [
          { type: 'text', text: `User says: "${userDescription}"` },
          { type: 'image_url', image_url: { url: `data:image/png;base64,${screenshot}` } },
        ],
      },
    ],
  });

  return parseDiagnosis(response.choices[0].message.content ?? '');
}

Multimodal Embeddings: The Next Frontier

Google's Gemini Embedding 2 is a game-changer for RAG pipelines. Until now, if you wanted to search across images and text, you needed separate embedding models for each modality and complex fusion logic to combine the results.

Gemini Embedding 2 maps text, images, video, audio, and documents into a single vector space. This means you can:

Embed a product catalog (images + descriptions) and search it with text queries
Build a knowledge base of screenshots and documentation that responds to natural language questions
Index video frames alongside transcripts for unified search

The implications for RAG pipelines are significant. Instead of text-only retrieval, you can retrieve the most semantically relevant content regardless of whether it is a paragraph, a diagram, a screenshot, or a video clip.

Cost and Latency Considerations

Multimodal API calls are more expensive than text-only calls. An image input adds roughly 765–1,105 tokens per image (depending on size and detail level) to your token consumption. At GPT-4o pricing, processing an image costs approximately $0.003–$0.01 per image.

For high-volume image processing, this adds up. If you process 10,000 images per day, you are looking at $30–$100/day in additional API costs. For document extraction workflows, consider:

Batch processing. Group images and process them during off-peak hours.
Caching. If the same image is submitted multiple times (common in support workflows), cache the extraction result.
Selective processing. Not every image needs the most expensive model. Use a cheaper model for initial triage and route only complex images to GPT-4o.

Latency is also higher. A multimodal API call typically takes 3–8 seconds, compared to 1–3 seconds for text-only. For real-time applications, consider processing images asynchronously and displaying results progressively.

What Does Not Work Yet

Multimodal AI has real limitations:

Fine-grained spatial reasoning. The model can tell you there is a button in the corner, but it cannot precisely locate it by pixel coordinates.
Small text in images. Text smaller than 12px in screenshots is often missed or misread.
Complex charts and graphs. Simple bar charts work well. Multi-axis scatter plots with dense labels are unreliable.
Real-time video. Processing individual frames works, but real-time video analysis at 30fps is neither practical nor affordable via API.

For these cases, traditional computer vision tools — OpenCV, Tesseract, or specialized models — are still the better choice. Multimodal LLMs are best at understanding the meaning of an image, not performing pixel-level analysis.

The vision capabilities in modern LLMs are not a replacement for the entire computer vision field. They are a remarkably accessible entry point for developers who need "good enough" image understanding without building a dedicated ML pipeline. And for most application-level use cases — document extraction, visual Q&A, content moderation, UI analysis — "good enough" is more than enough.