It was 2 AM on a Tuesday when the Anthropic API went down. Not a degradation — a full outage. Our compliance search engine, which handled about 200 queries per hour from enterprise users across three time zones, returned the same error for every request: "Service Unavailable." For 47 minutes, our AI feature was a blank page with a spinner.
We had spent months fine-tuning our RAG pipeline, perfecting our system prompts, and optimizing our context window. We had not spent a single hour planning for the scenario where the AI simply was not there.
That was a $12,000 lesson. Here is what I built afterward.
The Three Failure Modes
AI features fail in ways that traditional software does not. Understanding the failure modes is the first step to handling them.
Total outage. The API is down. You get 500 errors or timeouts. This is the easiest to detect and the most dramatic. It happened to us with Anthropic, and it happens to every provider eventually.
Degraded quality. The API responds, but the answers are worse than usual. Maybe the model is overloaded and producing shorter, less detailed responses. Maybe a provider-side update changed behavior. This is the hardest failure to detect because the system appears to work — it just works badly.
Slow responses. The API responds correctly, but takes 15–30 seconds instead of the usual 2–3. For streaming UIs, this means an agonizing wait before the first token. For synchronous operations, this means timeouts and retries that compound the problem.
Each failure mode needs a different response strategy. A circuit breaker handles outages. Quality monitoring handles degradation. Timeout policies handle slowness. You need all three.
The Circuit Breaker
A circuit breaker is a pattern borrowed from electrical engineering. When a service fails repeatedly, you "trip" the circuit — stop sending requests and immediately return a fallback response. After a cooldown period, you let a few requests through to test if the service has recovered.
class CircuitBreaker {
private failures = 0;
private lastFailure = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private threshold = 5,
private cooldownMs = 30000
) {}
async execute<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailure > this.cooldownMs) {
this.state = 'half-open';
} else {
return fallback();
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
return fallback();
}
}
private onSuccess() {
this.failures = 0;
this.state = 'closed';
}
private onFailure() {
this.failures++;
this.lastFailure = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
}
}
}
After that 2 AM outage, I wrapped every LLM call in a circuit breaker. When Anthropic goes down, we stop hammering their API after five failures and immediately serve fallback responses. This does two things: it gives users an instant response instead of a timeout, and it prevents our retry logic from making the provider's recovery harder.
Designing Fallback Responses
The fallback is the most underappreciated part of an AI feature. When I first implemented circuit breakers, my fallback was a generic error message: "AI is temporarily unavailable. Please try again later."
This is terrible. The user came to your product to get something done. Telling them to "try later" is telling them to go away.
Better fallbacks are contextual and useful:
function getComplianceFallback(query: string): FallbackResponse {
// Search pre-computed FAQ database
const faqMatch = searchFAQ(query);
if (faqMatch && faqMatch.confidence > 0.8) {
return {
response: faqMatch.answer,
source: 'faq',
disclaimer: 'This answer is from our FAQ database. For detailed guidance, please try again shortly.',
};
}
// Return relevant documentation links
const relevantDocs = searchDocumentIndex(query);
if (relevantDocs.length > 0) {
return {
response: `I'm currently unable to provide a detailed answer, but these documents may help:`,
documents: relevantDocs.slice(0, 3),
source: 'document-index',
};
}
// Last resort: acknowledge and offer alternatives
return {
response: 'Our AI assistant is temporarily unavailable. You can browse our regulation database directly or contact our support team.',
source: 'static',
contactLink: '/support',
};
}
The hierarchy matters: try to answer from a local cache or FAQ first, then offer relevant resources, and only as a last resort show a generic message. At Complai, our pre-computed FAQ covers about 60% of user queries, so most users during an outage still get a useful answer.
Provider Failover
If your AI feature is critical enough that any downtime is unacceptable, implement provider failover. This is the multi-cloud strategy applied to LLMs.
const providers = [
{ name: 'anthropic', adapter: new AnthropicAdapter(), priority: 1 },
{ name: 'openai', adapter: new OpenAIAdapter(), priority: 2 },
];
async function callWithFailover(prompt: string): Promise<string> {
const sorted = providers.sort((a, b) => a.priority - b.priority);
for (const provider of sorted) {
try {
const response = await Promise.race([
provider.adapter.sendMessage(prompt),
timeout(10000), // 10-second timeout
]);
return response;
} catch (error) {
console.warn(`Provider ${provider.name} failed, trying next`);
continue;
}
}
// All providers failed — use fallback
return getFallbackResponse(prompt);
}
function timeout(ms: number): Promise<never> {
return new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), ms)
);
}
The adapter pattern I wrote about previously makes this possible. Because all providers implement the same interface, failover is just a loop through the provider list.
The trade-off is cost and consistency. Your secondary provider might give different-quality responses than your primary. For regulated applications, you need to validate that the backup provider's output meets your quality standards before enabling failover in production.
Quality Monitoring
Detecting degraded quality is harder than detecting outages. The API returns 200 OK, but the answers are getting worse.
I track three proxy metrics for quality:
interface QualityMetrics {
averageResponseLength: number;
citationRate: number; // Percentage of responses that include source citations
refusalRate: number; // Percentage of "I don't know" responses
}
async function trackQuality(query: string, response: string): Promise<void> {
const hasCitation = /\[Source|§|regulation|section/i.test(response);
const isRefusal = /I (?:don't|cannot|can't) (?:find|answer|help)/i.test(response);
await db.query(
`INSERT INTO quality_metrics (query, response_length, has_citation, is_refusal, created_at)
VALUES ($1, $2, $3, $4, now())`,
[query, response.length, hasCitation, isRefusal]
);
}
When the refusal rate spikes above 30% or the average response length drops below 50% of the trailing average, an alert fires. These are early indicators that something has changed on the provider's side — a model update, capacity issues, or a change in behavior that affects our specific use case.
Timeout Strategy
The default timeout for most HTTP clients is 30 seconds. For AI features, this is almost always too long. A user who waits 30 seconds for a response has already left.
My timeout strategy is tiered:
- Streaming responses: 5-second timeout for the first token, then no timeout for subsequent tokens. If the first token does not arrive in 5 seconds, fail fast.
- Synchronous calls: 10-second hard timeout. If the model has not responded in 10 seconds, use the fallback.
- Background processing: 60-second timeout. Batch operations can afford to wait longer because the user is not staring at a spinner.
async function callWithTimeout(
fn: () => Promise<string>,
timeoutMs: number,
fallback: string
): Promise<{ result: string; timedOut: boolean }> {
try {
const result = await Promise.race([
fn(),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('LLM timeout')), timeoutMs)
),
]);
return { result, timedOut: false };
} catch {
return { result: fallback, timedOut: true };
}
}
The Resilience Checklist
Before shipping any AI feature, I run through this checklist:
- What happens when the API is down? If the answer is "the feature breaks," you need a fallback.
- What happens when the API is slow? If the answer is "the user waits," you need a timeout strategy.
- What happens when the API returns garbage? If the answer is "we show it to the user," you need output validation.
- How will you know when quality degrades? If the answer is "users will complain," you need quality monitoring.
- Can the feature work without AI at all? If the answer is "partially," build that partial experience as your fallback.
Every "no" on this checklist is a production incident waiting to happen. I learned this the hard way at 2 AM. You do not have to.
The best AI features are the ones where users never realize the AI failed. They got a slightly different experience — maybe a cached answer, maybe a FAQ result, maybe a documentation link — but they still got value. That is the goal.
The AI is a dependency, not the product. Build your product to survive without its most impressive dependency, and it will be a better product even when the AI is working perfectly.