8 min read

AI Code Review: How LLMs Catch Bugs Your Linter Misses

My last production bug passed TypeScript strict mode and every lint rule. An LLM found it in 5 seconds. Here's how to add AI code review to your CI pipeline for $2/month.

AICode ReviewCI/CDDeveloper Tools

Last month, I pushed a commit that passed TypeScript strict mode, cleared every lint rule, and broke production. The bug was a race condition in our data pipeline — two concurrent requests could write to the same database row, and the second write would silently overwrite the first. No linter catches race conditions. No type system models temporal behavior. But when I described the code to an LLM and asked "What could go wrong?", it identified the race condition in under five seconds.

That experience changed how I think about code review. Static analysis catches syntax and type errors. Human reviewers catch architectural issues and business logic mistakes. LLMs catch a third category: subtle semantic bugs that are invisible to tooling and easy for humans to miss.

What LLMs Are Good At

LLMs excel at code review tasks that require understanding intent rather than enforcing rules. A linter can tell you that a variable is unused. An LLM can tell you that your error handling logic has a gap where a specific exception type will crash the process.

The categories where I have seen the most value:

Concurrency bugs. Race conditions, deadlocks, and ordering issues are notoriously hard to spot in code review. LLMs are surprisingly good at reasoning about temporal sequences and identifying windows where concurrent operations can conflict.

Edge cases in business logic. "What happens when the input is an empty array?" "What if the user has no permissions?" "What if the API returns a 200 with an empty body?" These are the questions that experienced reviewers ask, and LLMs ask them too — often more consistently than tired humans on their fifth PR of the day.

Security vulnerabilities. SQL injection, path traversal, insecure deserialization — LLMs have seen thousands of examples in their training data and can spot these patterns in new code.

API misuse. Using a library function incorrectly, passing wrong parameter types to a loosely typed API, or missing required cleanup steps (closing connections, releasing locks) are all patterns that LLMs catch well.

What LLMs Are Bad At

LLMs are not a replacement for human code review. They have significant blind spots:

Architecture decisions. An LLM cannot tell you whether a new abstraction layer is warranted or whether a function belongs in this module or that one. These decisions require understanding the project's history, team preferences, and long-term direction.

Performance in context. An LLM might flag a nested loop as O(n²), but it cannot tell you whether that matters for your dataset size. Performance optimization requires domain knowledge that the model does not have.

Business requirements. The model does not know that your application requires HIPAA compliance, or that this particular field must never be null because downstream systems crash. It reviews the code as code, not as a business artifact.

False positives. LLMs produce false positives. Not constantly, but often enough that you cannot trust them blindly. Every finding needs human verification.

A Practical CI Pipeline

Here is how I integrated LLM code review into our CI pipeline at Lit Alerts. The goal was to add a review step that catches bugs without slowing down the development cycle.

import { z } from 'zod';

const ReviewFinding = z.object({
  severity: z.enum(['critical', 'warning', 'suggestion']),
  file: z.string(),
  line: z.number().optional(),
  description: z.string(),
  suggestion: z.string().optional(),
});

const ReviewResult = z.object({
  findings: z.array(ReviewFinding),
  summary: z.string(),
});

async function reviewDiff(diff: string): Promise<z.infer<typeof ReviewResult>> {
  const result = await llm.generateStructured({
    model: 'gpt-4o',
    schema: ReviewResult,
    messages: [
      {
        role: 'system',
        content: `You are a senior code reviewer. Analyze the git diff and identify:
1. Bugs: race conditions, null pointer risks, resource leaks, error handling gaps
2. Security issues: injection vulnerabilities, authentication gaps, data exposure
3. Logic errors: off-by-one, incorrect boundary conditions, missing edge cases

Do NOT flag:
- Style issues (formatting, naming conventions)
- Minor optimizations that don't affect correctness
- Anything a linter or type checker would catch

Be specific. Include the file name and line number when possible.
Only report issues you are confident about. False positives waste developer time.`,
      },
      {
        role: 'user',
        content: `Review this diff:\n\n${diff}`,
      },
    ],
  });

  return result;
}

The system prompt is critical. Without explicit instructions to avoid style issues and linter-catchable problems, the model wastes its analysis on formatting complaints. I want it focused on the semantic bugs that nothing else catches.

Integrating with GitHub Actions

The review runs as a GitHub Action on every pull request. It posts findings as PR comments, categorized by severity.

name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get diff
        run: git diff origin/main...HEAD > diff.txt

      - name: Run AI review
        run: node scripts/ai-review.mjs
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      - name: Post comments
        if: steps.review.outputs.has_findings == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const findings = JSON.parse(process.env.FINDINGS);
            for (const finding of findings) {
              await github.rest.pulls.createReviewComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                pull_number: context.payload.pull_request.number,
                body: `**${finding.severity}**: ${finding.description}\n\n${finding.suggestion || ''}`,
                path: finding.file,
                line: finding.line || 1,
              });
            }

A few important design decisions:

Non-blocking. The AI review never blocks merging. It posts comments, but it does not fail the build. This prevents false positives from creating friction.

Diff-only. The model reviews only the changed code, not the entire file. This keeps the context window manageable and the cost low.

Severity filtering. In practice, I only surface "critical" findings as PR comments. "Warning" and "suggestion" findings get logged to a dashboard for periodic review.

Cost and Performance

At Lit Alerts, we average about 15 pull requests per week. Each review costs approximately:

  • Average diff size: ~300 lines (~2,000 tokens)
  • System prompt: ~500 tokens
  • Output: ~500 tokens
  • Cost per review: ~$0.03 with GPT-4o

That is about $2 per month for automated code review. Even if the model catches one bug per month that would have reached production, the ROI is extraordinary.

The latency is more significant. A review takes 5–15 seconds, which adds to the PR feedback loop. But since it runs in parallel with other CI checks (tests, linting, type checking), it rarely adds to the total pipeline duration.

What We Have Caught

In three months of running AI code review, here are the most notable catches:

  1. A missing await on a database transaction commit. The code appeared to work in testing because the event loop usually flushed the commit before the response was sent. In production under load, it would intermittently lose writes.

  2. A regex denial-of-service vulnerability. A user-facing search field used a regex pattern that could be exploited with carefully crafted input to consume 100% CPU for several seconds.

  3. An authentication bypass in an API route. A new endpoint was added without the authentication middleware that every other endpoint used. The developer simply forgot to include it.

  4. A memory leak in a WebSocket handler. Event listeners were being added on each connection but never removed on disconnect. Over time, this would exhaust memory.

None of these were caught by our linter, our type checker, or our test suite. Three of the four would have required specific knowledge or experience to spot in human review.

Practical Advice

If you want to add AI code review to your workflow:

  1. Start with a non-blocking comment bot. Do not gate merges on AI findings until you have tuned the system to minimize false positives.
  2. Focus the model on categories your existing tools miss. If your linter already catches everything stylistic, tell the model to skip style issues entirely.
  3. Use structured output. Parsing free-text review comments is fragile. Use Zod schemas to enforce a consistent finding format.
  4. Track false positive rate. If more than 20% of findings are false positives, developers will start ignoring all findings. Tune your prompt to be more conservative.
  5. Review the reviews. Spend 15 minutes each week reading the AI's findings. You will learn what it is good at, what it misses, and how to improve the prompt.

AI code review is not a replacement for human reviewers. It is a second pair of eyes that never gets tired, never rushes through a review, and has seen more code patterns than any individual developer. Use it as a complement, not a substitute.


References

Ask about Kyle
AI-powered resume assistant

Ask me about Kyle's skills, experience, or projects