On-Device AI: Running Models in the Browser with WebGPU

I had a client who needed AI-powered text classification in their application but refused to send any data to an external API. They were in healthcare, and their compliance team vetoed every cloud-based AI solution. Patient data could not leave the device. Period.

For months, this seemed like a dead end. You cannot run GPT-4 in a browser. But you can run a 30-million-parameter model that classifies text with 94% accuracy — entirely on the user's device, with no API calls, no data transmission, and no monthly inference bill. That is the promise of on-device AI, and in 2026, it is finally practical.

Why On-Device Matters

The typical AI feature makes an API call. The user's data travels from their browser to your server, from your server to an LLM provider, and back. This round trip has three costs:

Latency. Even with edge proxying, a typical LLM API call takes 1–5 seconds. On-device inference for small models takes 10–100 milliseconds.

Privacy. The data leaves the user's device. For healthcare, finance, legal, and government applications, this is often a non-starter. Even with encryption and data processing agreements, the compliance burden is significant.

Cost. Every API call costs money. At scale, those costs add up. On-device inference costs nothing after the initial model download.

The trade-off is model capability. You are not running Claude or GPT-4 in the browser. You are running small, specialized models that do one thing well: classification, entity extraction, summarization of short texts, or semantic similarity matching.

The Technology Stack in 2026

Three technologies make on-device AI practical today:

WebGPU

WebGPU is the successor to WebGL for GPU-accelerated computation in the browser. Unlike WebGL, which was designed for graphics, WebGPU exposes general-purpose GPU compute through compute shaders. This makes it suitable for running neural network inference.

As of early 2026, WebGPU is supported in Chrome, Edge, and Firefox. Safari support is partial but improving. For applications that need broad browser support, you fall back to WebAssembly (WASM) on browsers that do not support WebGPU — slower, but functional.

ONNX Runtime Web

ONNX Runtime Web is Microsoft's inference engine that runs ONNX-format models in the browser. It supports both WebGPU and WASM backends, automatically selecting the fastest available option.

The key advantage of ONNX Runtime is the model ecosystem. Most popular model architectures — BERT, DistilBERT, T5-small, MobileNet — can be exported to ONNX format and run directly in the browser.

Transformers.js

Hugging Face's Transformers.js library brings the familiar Hugging Face API to JavaScript. It wraps ONNX Runtime and provides a high-level interface for running models.

import { pipeline } from '@huggingface/transformers';

// Load a text classification model — downloads once, runs locally
const classifier = await pipeline(
  'text-classification',
  'Xenova/distilbert-base-uncased-finetuned-sst-2-english',
  { device: 'webgpu' }
);

const result = await classifier('This product is amazing');
// [{ label: 'POSITIVE', score: 0.9998 }]

The first call downloads the model (typically 30–100MB for small models). Subsequent calls run entirely on-device with no network requests.

Practical Use Cases

Not every AI task belongs on-device. Here are the patterns where I have found it most valuable:

Real-Time Text Classification

For the healthcare client, we deployed a fine-tuned DistilBERT model that classifies patient intake form responses into urgency categories. The model runs on every keystroke with debouncing, providing instant feedback as the clinician types.

import { pipeline } from '@huggingface/transformers';

let classifier: any = null;

async function getClassifier() {
  if (!classifier) {
    classifier = await pipeline(
      'text-classification',
      'our-org/medical-urgency-classifier',
      { device: 'webgpu' }
    );
  }
  return classifier;
}

async function classifyUrgency(text: string): Promise<{
  level: 'routine' | 'urgent' | 'emergency';
  confidence: number;
}> {
  const model = await getClassifier();
  const results = await model(text);
  return {
    level: results[0].label.toLowerCase(),
    confidence: results[0].score,
  };
}

Inference takes 15–30 milliseconds on a modern laptop. No API call. No data leaving the device. The compliance team signed off immediately.

Semantic Search in the Browser

For applications with small-to-medium knowledge bases (under 10,000 documents), you can run the entire search pipeline on-device. Embed the query locally, compare against pre-computed document embeddings shipped with the application, and return results without a server round trip.

import { pipeline } from '@huggingface/transformers';

const embedder = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2',
  { device: 'webgpu' }
);

async function searchDocuments(
  query: string,
  documents: { text: string; embedding: number[] }[]
): Promise<{ text: string; similarity: number }[]> {
  const queryEmbedding = await embedder(query, {
    pooling: 'mean',
    normalize: true,
  });

  const queryVector = Array.from(queryEmbedding.data);

  return documents
    .map((doc) => ({
      text: doc.text,
      similarity: cosineSimilarity(queryVector, doc.embedding),
    }))
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, 5);
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

This pattern works well for documentation search, product catalogs, and FAQ systems where the corpus fits in the browser's memory.

Smart Form Validation

Beyond simple regex validation, on-device models can provide intelligent form validation: detecting typos in names, validating that free-text descriptions are relevant to the form context, or auto-categorizing entries.

Performance Considerations

On-device inference has real constraints:

Model download size. A DistilBERT model is ~65MB. That is a significant initial download. Mitigate with lazy loading — only download the model when the user first interacts with the AI feature, and cache it in IndexedDB for subsequent visits.

// Cache model files in IndexedDB via the library's built-in caching
const model = await pipeline('text-classification', 'model-name', {
  cache_dir: 'indexeddb://models',
  device: 'webgpu',
});

Memory usage. A small model uses 100–200MB of GPU memory. On mobile devices with limited GPU memory, this can cause issues. Always check device capabilities before loading a model, and provide a server-side fallback for low-end devices.

First-inference latency. The first inference after loading a model is slower than subsequent ones because the GPU needs to compile the compute shaders. This "warm-up" can take 1–3 seconds. Pre-warm the model during page load if possible.

Browser support. WebGPU is not universal. As of March 2026, about 75% of desktop browsers support it. Mobile support is spottier. Always implement a WASM fallback.

The Hybrid Pattern

The most practical architecture is hybrid: use on-device models for tasks that benefit from speed and privacy, and fall back to cloud APIs for tasks that require large model capabilities.

async function processText(text: string): Promise<ProcessedResult> {
  // Step 1: On-device classification (fast, private)
  const category = await localClassifier.classify(text);

  // Step 2: Only call the cloud API if the task requires it
  if (category.needsDetailedAnalysis) {
    return await cloudAPI.analyze(text);
  }

  // For simple categories, on-device is sufficient
  return { category: category.label, confidence: category.score };
}

In the healthcare application, 70% of classifications are handled entirely on-device. Only the ambiguous cases (confidence below 0.85) get escalated to a cloud model for a second opinion. This reduces API costs by 70% while maintaining accuracy.

When to Go On-Device

On-device AI makes sense when:

Privacy is a hard requirement. Healthcare, finance, legal, government.
Latency matters more than capability. Real-time feedback, keystroke-by-keystroke analysis.
The task is narrow. Classification, extraction, similarity — not open-ended generation.
Offline support is needed. Field workers, mobile apps in low-connectivity environments.
Volume makes API costs prohibitive. If you process thousands of items per user session, on-device eliminates per-call costs entirely.

On-device AI does not make sense when:

You need frontier model capabilities. Complex reasoning, long-form generation, multi-step analysis.
The model needs frequent updates. Retraining and redeploying browser-cached models is harder than updating a server-side endpoint.
Your users are on old hardware. WebGPU needs a decent GPU, and WASM fallback is 5–10x slower.

The browser is not going to replace the cloud for AI. But for the right use cases, it eliminates entire categories of complexity — no API keys, no server costs, no privacy concerns. The models are small, the inference is fast, and the user's data never leaves their machine. That is a powerful combination.