Running LLMs at the Edge: AI on Cloudflare Workers

The latency problem is the silent killer of modern web applications. When I’m building features that rely on LLMs, I’m constantly reminded of the physical distance between my users and the compute resources. If a user in Tokyo triggers an AI-powered feature on my site, and my backend is calling an OpenAI endpoint in us-east-1, that round-trip time is brutal. It’s not just the inference time; it’s the network overhead, the TLS handshake, and the sheer distance the data has to travel. This is where edge computing, specifically using LLMs with Cloudflare Workers, changes the game.

What "AI at the Edge" Actually Means

When we talk about "edge AI," it’s easy to conflate two very different architectural patterns. It’s important to distinguish between them because they solve different problems and have vastly different trade-offs.

First, there is orchestrating AI calls FROM the edge. In this pattern, your Cloudflare Worker acts as a high-performance proxy. It receives the user's request, handles authentication, potentially performs some lightweight prompt engineering or RAG (Retrieval-Augmented Generation) lookups, and then forwards the request to a centralized LLM API like OpenAI or Anthropic. The benefit here is low-latency request handling and global distribution of your application logic, even if the final inference still happens in a centralized data center.

Second, there is running inference ON the edge. This is what Cloudflare’s Workers AI platform enables. Instead of proxying to a third-party API, you are running the model inference directly on Cloudflare’s global GPU fleet. This is true edge AI. The inference happens in a data center physically closer to the user, drastically reducing latency and keeping the data within the Cloudflare network.

Calling LLM APIs from Workers

The proxy pattern is the most common starting point. It’s straightforward to implement and allows you to leverage the best models available today. The key is to handle the streaming response correctly to ensure the user gets immediate feedback.

Here is a practical example of how I handle streaming requests to an LLM API from a Cloudflare Worker:

export default {
  async fetch(request, env, ctx): Promise<Response> {
    const { prompt } = await request.json();

    const response = await fetch("https://api.anthropic.com/v1/messages", {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "x-api-key": env.ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
      },
      body: JSON.stringify({
        model: "claude-3-5-sonnet-20240620",
        max_tokens: 1024,
        stream: true,
        messages: [{ role: "user", content: prompt }],
      }),
    });

    // Stream the response back to the client
    return new Response(response.body, {
      headers: { "Content-Type": "text/event-stream" },
    });
  },
} satisfies ExportedHandler<Env>;

This pattern is incredibly powerful for building responsive AI chat interfaces. By streaming the response, you hide the latency of the LLM inference itself, making the application feel instantaneous.

Using Cloudflare Workers AI

When you need to run inference directly on the edge, Cloudflare Workers AI is the tool of choice. It abstracts away the complexity of managing GPU infrastructure. You simply bind the AI service to your Worker and start making calls.

import { Ai } from '@cloudflare/ai';

export default {
  async fetch(request, env, ctx) {
    const ai = new Ai(env.AI);
    const { prompt } = await request.json();

    const response = await ai.run('@cf/meta/llama-3-8b-instruct', {
      prompt: prompt,
      stream: true,
    });

    return new Response(response, {
      headers: { "Content-Type": "text/event-stream" },
    });
  },
} satisfies ExportedHandler<Env>;

The model catalog is growing rapidly, covering everything from text generation to image classification and embeddings. The trade-off here is model selection. You are limited to the models Cloudflare supports, which might not always be the absolute latest state-of-the-art model available via API. However, for many use cases, the latency benefits of running inference on the edge far outweigh the need for the absolute largest model.

"The true power of edge AI isn't just about speed; it's about the ability to build intelligent applications that were previously impossible due to latency constraints."

The Runtime Constraints

Working with Cloudflare Workers means operating within a specific set of constraints. You don't have a traditional server environment. There is no persistent file system, and you have strict CPU time and memory limits.

These constraints are particularly relevant for AI workloads. If you are doing heavy prompt engineering or complex RAG, you need to be mindful of your execution time. Streaming is not just a UX feature; it’s a necessity to stay within the Worker's execution limits. For long-running agents or complex multi-step workflows, you should look into offloading the heavy lifting to Durable Objects, which provide persistent state and longer execution times.

Performance Results

In my own testing on this site, moving from a centralized backend to a Cloudflare Worker proxy reduced my Time to First Byte (TTFB) for AI-powered features by over 400ms for users in Asia. When using Workers AI for smaller, specialized models, the latency is even lower, often under 100ms for the initial response.

"Edge computing for LLMs is the difference between an application that feels like a prototype and one that feels like a production-grade product."

By strategically choosing between proxying to powerful centralized APIs and running inference directly on the edge with Workers AI, you can build AI applications that are both intelligent and incredibly fast, regardless of where your users are located.