8 min read

The Latency Tax: Why AI Features Feel Slow and How to Fix It

Users abandoned our AI tool — not because the output was bad, but because it took 5 seconds. Streaming, skeleton states, predictive loading, and the 500ms rule that changed everything.

AIPerformanceUXReact

The first time I shipped an AI feature, I thought the hard part was getting the model to produce good output. I was wrong. The hard part was getting users to wait for it.

Our AI-powered data cleaning tool at Lit Alerts produced excellent results. But users abandoned it. The average LLM call took 3–5 seconds, and in a world where Google trained everyone to expect results in 200 milliseconds, five seconds felt like an eternity. We had a product problem disguised as a performance problem.

Here is what I learned about making AI features feel fast — even when they are not.

The Perception Problem

The actual speed of your AI feature matters less than the perceived speed. Research from the Nielsen Norman Group has shown this for decades: users tolerate longer waits when they can see progress. A 10-second task with a progress indicator feels faster than a 5-second task with a blank screen.

LLM calls are uniquely painful because they combine two of the worst UX antipatterns: unpredictable duration and zero visual feedback. The user clicks a button, and then... nothing. For anywhere from 1 to 30 seconds. They do not know if the app is broken, if the request failed, or if they should try again.

This is the latency tax. It is not about making the model faster. It is about making the wait tolerable.

Stream Everything

The single most impactful thing you can do is stream your LLM responses. Instead of waiting for the entire response to generate and then showing it all at once, display tokens as they arrive.

async function streamResponse(
  prompt: string,
  onToken: (token: string) => void
): Promise<void> {
  const response = await fetch('/api/ai/chat', {
    method: 'POST',
    body: JSON.stringify({ prompt }),
  });

  const reader = response.body?.getReader();
  const decoder = new TextDecoder();

  while (reader) {
    const { done, value } = await reader.read();
    if (done) break;
    onToken(decoder.decode(value));
  }
}

Streaming transforms a 5-second wait into a 5-second reading experience. The user sees the first token in under 500 milliseconds, and their brain shifts from "waiting" to "reading." That cognitive shift is everything.

I covered the server-side implementation of streaming in a previous post. The point here is about the UX: streaming is not just a technical optimization. It is a product decision that fundamentally changes how your AI feature feels.

Optimistic UI for AI Actions

Not every AI interaction is a chat. Sometimes the AI is classifying data, generating a summary, or filling in form fields. For these discrete actions, streaming does not apply. Instead, use optimistic UI patterns.

The idea is simple: show the expected result immediately, then confirm or correct it when the AI responds.

function useAIClassification(text: string) {
  const [result, setResult] = useState<Classification | null>(null);
  const [isConfirmed, setIsConfirmed] = useState(false);

  useEffect(() => {
    // Show a placeholder immediately
    setResult({ category: 'analyzing...', confidence: 0 });

    classifyText(text).then((classification) => {
      setResult(classification);
      setIsConfirmed(true);
    });
  }, [text]);

  return { result, isConfirmed };
}

In the Lit Alerts dashboard, we show a skeleton of the cleaned data immediately, with a subtle shimmer animation on the fields the AI is still processing. The user can see the structure of what is coming, which reduces anxiety and keeps them engaged.

Predictive Loading

If you can anticipate what the user will do next, you can start the AI call before they ask for it.

On the Complai compliance search, when a user opens a document, I pre-fetch AI summaries for the three most commonly asked questions about that document type. By the time the user clicks "Summarize," the result is already cached. It feels instant.

function prefetchSummaries(documentId: string, documentType: string) {
  const commonQuestions = getCommonQuestions(documentType);

  commonQuestions.forEach((question) => {
    queryClient.prefetchQuery({
      queryKey: ['ai-summary', documentId, question],
      queryFn: () => generateSummary(documentId, question),
      staleTime: 1000 * 60 * 30, // Cache for 30 minutes
    });
  });
}

TanStack Query makes this trivial. The prefetchQuery call fires in the background, populates the cache, and when the user eventually triggers the query, it resolves immediately from cache.

Skeleton States That Communicate

A loading spinner says "wait." A skeleton state says "here is what is coming." The difference in user patience is measurable.

For AI features, the skeleton should match the expected output structure. If the AI is generating a three-column table, show a three-column skeleton. If it is generating a paragraph of text, show text-shaped placeholders with a shimmer effect.

The key is specificity. A generic spinner gives the user no information about what is happening. A skeleton that matches the output shape tells them: "The system is working, and this is what the result will look like."

The best loading state is one that the user barely notices. If your skeleton matches the final output closely enough, the transition from loading to loaded becomes a subtle reveal rather than a jarring replacement.

Chunked Processing for Batch Operations

When an AI feature processes multiple items — classifying a list of emails, generating descriptions for a product catalog — the worst thing you can do is wait for all of them to finish before showing any results.

Instead, process items in small batches and render each batch as it completes.

async function processBatch<T>(
  items: T[],
  processor: (item: T) => Promise<T>,
  onBatchComplete: (processed: T[]) => void,
  batchSize = 5
): Promise<void> {
  for (let i = 0; i < items.length; i += batchSize) {
    const batch = items.slice(i, i + batchSize);
    const results = await Promise.all(batch.map(processor));
    onBatchComplete(results);
  }
}

In the Lit Alerts data pipeline, we process records in batches of 10. The user sees the first batch of cleaned data in 2–3 seconds, and the rest fills in progressively. The total processing time is the same, but the perceived time drops dramatically because the user sees meaningful output almost immediately.

Time-to-First-Token Optimization

If you are running your own AI infrastructure or proxying through an edge layer, optimizing time-to-first-token (TTFT) is the highest-leverage performance investment you can make.

TTFT is the time between the user's request and the first token of the response appearing on screen. For streaming UIs, this is the only latency that matters — everything after the first token is reading time, not waiting time.

A few practical techniques:

  • Edge proxy your LLM calls. If your user is in Tokyo and your server is in Virginia, that is 150ms of network latency before the request even reaches the LLM. Running your API proxy on Cloudflare Workers puts it within 50ms of the user.
  • Minimize prompt size. Every token in your system prompt adds to the time the model spends processing before it starts generating. Trim your system prompts ruthlessly.
  • Use smaller models for simple tasks. A model routing strategy that sends classification tasks to a small model (sub-1s TTFT) and complex generation to a larger model can cut average TTFT significantly.

Measuring What Matters

You cannot improve what you do not measure. For AI features, track these metrics:

  1. Time-to-first-token (TTFT): How long before the user sees anything.
  2. Total generation time: How long the entire response takes.
  3. Perceived completion rate: What percentage of users wait for the full result vs. abandoning.
  4. Error-after-wait rate: How often users wait 5+ seconds only to see an error. This is the worst possible UX.

Log these metrics per feature, per model, and per user cohort. You will find that your heaviest users are the most sensitive to latency, because they have experienced your fast paths and now expect them everywhere.

The 500ms Rule

Here is my personal rule of thumb: if the user sees meaningful output within 500 milliseconds of their action, they will not perceive any latency. Between 500ms and 2 seconds, they notice but tolerate it. Beyond 2 seconds, you need a progress indicator or streaming. Beyond 5 seconds without feedback, you are losing users.

Design every AI interaction around this timeline. If the model takes 3 seconds, stream it. If it takes 10 seconds, stream it AND show a progress indicator. If it takes 30 seconds, rethink the interaction model entirely — maybe it should be an async job with a notification.

The latency tax is real, and it kills AI features that would otherwise be excellent. The fix is not faster models. It is better UX engineering. Stream your responses, show skeleton states, prefetch when you can, and always, always give the user something to look at while they wait.


References

Ask about Kyle
AI-powered resume assistant

Ask me about Kyle's skills, experience, or projects