9 min read

LLM Guardrails in Production: How to Keep AI on Script

A user asked our compliance AI to write a poem about marijuana regulations. It did, beautifully. That was the wrong answer. Here are the five layers of guardrails I built afterward.

AISecurityLLMProduction Deployment

The first week Complai was in production, a user asked the compliance assistant to write a poem about marijuana regulations. It did. Beautifully. In iambic pentameter. The problem was that Complai is a regulatory compliance tool, not a creative writing assistant. The AI was doing exactly what it was told — answering any question helpfully — but it was not doing what the product needed it to do.

That was when I learned that building an AI product is not just about making the model smart. It is about making it disciplined.

Why Guardrails Are a Product Decision

Guardrails are not about censorship. They are about product scope. When you build a search engine, you do not let users execute SQL queries against your database. When you build a calculator, you do not let users write emails. The same principle applies to AI features. Your AI should do what your product promises, and nothing else.

The challenge is that LLMs are generalists by nature. Without explicit boundaries, they will happily answer questions about cooking, debate philosophy, or roleplay as a pirate. That might be fine for a general chatbot. It is not fine for a compliance tool, a medical assistant, or an enterprise data pipeline.

Layer 1: System Prompt Boundaries

The first line of defense is your system prompt. This is where you define what the AI should and should not do. But most developers write system prompts that are too permissive.

// Too permissive — the model will answer anything
const weakPrompt = `You are a helpful assistant for our compliance platform.`;

// Better — explicit boundaries and refusal instructions
const strongPrompt = `You are a regulatory compliance assistant for the cannabis industry.

Your ONLY function is to answer questions about cannabis regulations, compliance requirements, and licensing procedures.

Rules:
1. Only answer questions directly related to cannabis regulation and compliance.
2. If a question is outside this scope, respond: "I can only help with cannabis regulatory compliance questions. Could you rephrase your question in that context?"
3. Never generate creative content, code, or general knowledge responses.
4. Always cite the specific regulation or source document when providing compliance guidance.
5. If you are unsure about a regulation, say so explicitly rather than guessing.`;

The key is specificity. "Be helpful" is not a guardrail. "Only answer questions about X, and refuse Y" is a guardrail. I structure every production system prompt with three sections: identity (who the AI is), scope (what it can do), and refusal instructions (what it must decline).

Layer 2: Input Validation

System prompts can be bypassed. Users are creative, and some are adversarial. The second layer of defense validates user input before it reaches the model.

interface InputValidation {
  isValid: boolean;
  reason?: string;
  sanitizedInput?: string;
}

function validateInput(input: string): InputValidation {
  // Length check
  if (input.length > 4000) {
    return { isValid: false, reason: 'Message too long' };
  }

  // Check for prompt injection patterns
  const injectionPatterns = [
    /ignore (?:all |your |previous )?instructions/i,
    /you are now/i,
    /pretend (?:to be|you are)/i,
    /disregard (?:all |your |the )?(?:above|previous)/i,
    /system\s*prompt/i,
  ];

  for (const pattern of injectionPatterns) {
    if (pattern.test(input)) {
      return { isValid: false, reason: 'Input contains restricted patterns' };
    }
  }

  return { isValid: true, sanitizedInput: input.trim() };
}

This is not foolproof. Determined attackers can find ways around pattern matching. But it catches the majority of casual attempts and accidental prompt injections that make up 95% of real-world issues.

Layer 3: Output Validation

Even with input validation and strong system prompts, the model can still go off-script. Output validation is your safety net.

For structured output, Zod validation handles this automatically — if the model returns data that does not match your schema, you reject it. But for free-text responses, you need a different approach.

interface OutputCheck {
  passed: boolean;
  violations: string[];
}

function validateOutput(response: string): OutputCheck {
  const violations: string[] = [];

  // Check for content that should never appear
  const blockedPatterns = [
    { pattern: /```(?:python|javascript|typescript|sql)/i, rule: 'No code generation' },
    { pattern: /(?:once upon a time|dear diary)/i, rule: 'No creative writing' },
    { pattern: /(?:I think|in my opinion|I believe)/i, rule: 'No personal opinions' },
  ];

  for (const { pattern, rule } of blockedPatterns) {
    if (pattern.test(response)) {
      violations.push(rule);
    }
  }

  // Check response length
  if (response.length > 10000) {
    violations.push('Response exceeds maximum length');
  }

  return { passed: violations.length === 0, violations };
}

When output validation fails, you have two options: retry with a corrective prompt, or return a safe fallback response. In production, I retry once with a prompt that includes the violation reason, and if it fails again, I return a fallback.

Layer 4: Topic Classification

For more sophisticated guardrails, use a lightweight classification model to determine whether the user's question is on-topic before sending it to the main model.

async function isOnTopic(
  input: string,
  allowedTopics: string[]
): Promise<{ onTopic: boolean; detectedTopic: string }> {
  const result = await llm.generateStructured({
    model: 'gpt-4o-mini', // Use a cheap, fast model for classification
    schema: z.object({
      topic: z.string(),
      isRelevant: z.boolean(),
      confidence: z.number().min(0).max(1),
    }),
    prompt: `Classify whether this question is related to any of these topics: ${allowedTopics.join(', ')}.
    
Question: "${input}"

Respond with the detected topic, whether it's relevant, and your confidence level.`,
  });

  return {
    onTopic: result.isRelevant && result.confidence > 0.7,
    detectedTopic: result.topic,
  };
}

This adds latency and cost, but for regulated industries — healthcare, finance, legal, compliance — it is non-negotiable. The cost of a single inappropriate response far exceeds the cost of an extra classification call.

Layer 5: Rate Limiting and Abuse Prevention

Guardrails also need to protect against abuse at the infrastructure level. A user who sends 100 requests per minute is either a bot or is stress-testing your system — either way, you need to throttle them.

const rateLimiter = new Map<string, { count: number; resetAt: number }>();

function checkRateLimit(userId: string, maxRequests = 20, windowMs = 60000): boolean {
  const now = Date.now();
  const userLimit = rateLimiter.get(userId);

  if (!userLimit || now > userLimit.resetAt) {
    rateLimiter.set(userId, { count: 1, resetAt: now + windowMs });
    return true;
  }

  if (userLimit.count >= maxRequests) {
    return false;
  }

  userLimit.count++;
  return true;
}

Beyond rate limiting, track usage patterns. A user who suddenly shifts from asking compliance questions to probing the model's system prompt is exhibiting suspicious behavior. Log it, flag it, and consider escalating to human review.

The Guardrail Stack in Practice

In production, these layers work together as a pipeline:

  1. Rate limit check — Is the user within their quota?
  2. Input validation — Is the input safe and well-formed?
  3. Topic classification — Is the question on-topic?
  4. LLM call — Generate the response with a strong system prompt.
  5. Output validation — Does the response meet quality standards?
  6. Logging — Record everything for audit and improvement.

Each layer catches different failure modes. Rate limiting catches abuse. Input validation catches injection attacks. Topic classification catches off-topic requests. The system prompt keeps the model focused. Output validation catches the model's mistakes. And logging gives you the data to improve all of the above.

The goal of guardrails is not to make the AI perfect. It is to make the failures predictable and manageable. A user getting a polite refusal is a successful guardrail. A user getting a poem about marijuana regulations is not.

Iterating on Guardrails

Guardrails are not set-and-forget. They require ongoing maintenance based on real user behavior.

Every week, I review the logs from Complai's guardrail pipeline. I look for three things:

  1. False positives: Legitimate questions that got blocked. These indicate overly aggressive guardrails that need loosening.
  2. False negatives: Off-topic or inappropriate responses that slipped through. These indicate gaps that need closing.
  3. New attack patterns: Users finding creative ways around existing guardrails. These require new rules.

The first month in production, I adjusted the guardrails 14 times. By month three, adjustments dropped to once or twice a month. The system learned from its failures — not automatically, but through my manual review and iteration.

Build your guardrails, ship them, and then watch them closely. The real learning happens in production, not in development.


References

Ask about Kyle
AI-powered resume assistant

Ask me about Kyle's skills, experience, or projects