[ AI INTEGRATION ] // CODE REVIEW

LLM code review that engineers actually read — not another bot they mute.

We embed Claude or GPT into your GitHub or GitLab pipeline as a tuned reviewer: scoped to the diff, split by concern, and gated by deterministic policy checks that block merges on real violations.

Veteran-Owned SDVOSB
[001 / 005] Field Conditions

Most teams turn on an AI reviewer, get buried in noise, and quietly mute it within a month.

// SITUATION

The default pattern is depressingly consistent. Someone installs an AI review tool on a Friday. By Monday, every PR has fifteen comments — half of them restating what the diff already shows, a few flagging style preferences from a different language, one or two genuinely useful. Developers start replying with eye-roll emojis, then ignoring the bot entirely. Security wanted it for catching secrets and injection; instead it complains about variable naming. Six weeks later the integration is disabled and the budget line gets cut. The tooling wasn't wrong — the configuration was.

  • Reviewer runs on the entire repo context, not the diff, so it surfaces issues the PR didn't introduce.
  • Single mega-prompt mixes security, correctness, and style — you can't tune or disable any of them independently.
  • No confidence threshold or feedback loop, so precision drifts and developers learn to ignore every comment.
  • Findings post as advisory comments only, so the secrets-in-code and missing-authz issues that should block merge don't.
2-4
comments per PR after tuning
< 3 wks
from shadow mode to blocking checks
70%+
precision floor enforced per category
[002 / 005] Operational Approach

Treat the bot like a junior reviewer with a tight scope, not an oracle.

  1. STEP-01

    Scope the review surface

    Run the LLM only on the diff plus a bounded blast radius — touched files, their direct importers, and relevant test files. Skip vendored code, generated files, lockfiles, and migrations. This alone cuts token spend 60-80% and kills most off-topic comments.

  2. STEP-02

    Split into specialized passes

    One prompt for security (injection, authz, secrets), one for correctness (null paths, race conditions, error handling), one for policy (logging PII, deprecated APIs, internal style rules). Separate prompts beat one mega-prompt because each can be tuned, evaluated, and disabled independently.

  3. STEP-03

    Tune signal-to-noise with a feedback loop

    Every developer reaction — 👍, 👎, resolved, ignored — gets logged. We replay weekly against a held-out set of merged PRs to measure precision. Below 70% precision on a category, that category gets retuned or muted. No silent drift.

  4. STEP-04

    Enforce policy as blocking checks

    Style suggestions stay as PR comments. Hard rules — hardcoded secrets, missing authz on new endpoints, unencrypted PII fields, banned dependencies — become required GitHub/GitLab checks that block merge. The LLM produces structured JSON; a deterministic gate decides pass/fail.

  5. STEP-05

    Wire into the existing review flow

    Comments post as a bot user via the GitHub Checks API or GitLab MR notes, threaded on the right lines, with a single summary comment at top. Developers can reply `/ignore` or `/explain` and the bot responds in-thread. No new dashboard to check.

// TYPESCRIPT PATTERN
// github-action: ai-review.ts
import { Octokit } from "@octokit/rest";
import { z } from "zod";
import { reviewWithClaude } from "./llm";

const Finding = z.object({
  category: z.enum(["security", "correctness", "policy"]),
  severity: z.enum(["block", "warn", "nit"]),
  file: z.string(),
  line: z.number(),
  rule: z.string(),       // e.g. "hardcoded-secret"
  message: z.string(),
  confidence: z.number().min(0).max(1),
});

const BLOCKING_RULES = new Set([
  "hardcoded-secret",
  "missing-authz-check",
  "unencrypted-pii",
  "banned-dependency",
]);

export async function run(prDiff: string, sha: string) {
  const raw = await reviewWithClaude(prDiff);
  const findings = z.array(Finding).parse(raw);

  // Drop low-confidence noise before humans see it
  const visible = findings.filter(f => f.confidence >= 0.75);

  const blocking = visible.filter(
    f => f.severity === "block" && BLOCKING_RULES.has(f.rule)
  );

  const gh = new Octokit({ auth: process.env.GH_TOKEN });
  await gh.checks.create({
    owner: process.env.OWNER!,
    repo: process.env.REPO!,
    name: "ai-review",
    head_sha: sha,
    status: "completed",
    conclusion: blocking.length ? "failure" : "success",
    output: { title: `${visible.length} findings`, summary: render(visible) },
  });
}

Structured-output review pass with a deterministic policy gate — the LLM finds candidates, but a rule decides whether to block the merge.

[003 / 005] Common Questions

Field FAQ.

How do you keep the AI reviewer from drowning developers in noise?

Three controls. First, confidence thresholds — findings below 0.75 confidence are dropped before posting. Second, per-category precision tracking against developer reactions; categories that fall below 70% precision get retuned or muted. Third, deduplication against prior comments on the same PR. Most teams we ship for end up with 2-4 comments per average PR, not 30. If developers start ignoring the bot, the integration is dead.

Does this replace human code review?

No, and any vendor claiming it does is selling you something. The LLM is good at pattern-matching known failure modes — missing error handling, obvious injection, secrets in code, deprecated APIs. It is bad at judging architecture, business logic, and whether a change should exist at all. We position it as a first-pass that frees senior reviewers to focus on the things only humans can evaluate.

How does it integrate with GitHub or GitLab?

On GitHub, it runs as a GitHub Action or App that posts via the Checks API and Reviews API — line comments thread on the diff, a summary comment goes at the top, and blocking findings register as a required check. On GitLab, it runs as a CI job using the Merge Request notes API with the same pattern. Self-hosted runners are supported, which matters for regulated and federal environments.

What about source code leaving our environment to call an LLM?

Three deployment options depending on your posture. Most commercial clients use Anthropic or OpenAI with zero-retention agreements and a redaction layer that strips secrets before the call. Regulated clients use Bedrock or Azure OpenAI inside their own VPC. Federal clients with CUI or higher run against an on-prem model — Llama 3.1 70B or Qwen2.5-Coder — on GovCloud GPUs. Diffs never leave the boundary.

How do you handle false positives on security findings?

Every finding includes the rule ID, the line, and the reasoning. Developers can reply `/ignore <rule-id> <reason>` in the PR thread, which suppresses that rule on that file path and logs the suppression for later audit. We review suppression patterns weekly — if one rule is suppressed by 40% of developers, the rule is wrong, not the developers. The prompt or the rule gets fixed.

Can it enforce our internal coding standards and policies?

Yes, and this is usually where it pays for itself. We encode internal rules — logging conventions, banned dependencies, required headers on new endpoints, PII handling, deprecation lists — as a policy document the reviewer is grounded against. Adding a new rule is a pull request to that document, not a code change to the reviewer. Teams typically start with 10-15 rules and grow from there.

What's a realistic timeline to get this running in production?

For a single repo with a clean CI pipeline, 2-3 weeks: week one to instrument and run shadow mode against historical PRs, week two to tune precision and write the policy rules, week three to enable blocking checks. For an org with 50+ repos and varied stacks, plan 8-12 weeks including rollout, training, and a feedback dashboard. We do not recommend turning on blocking checks before two weeks of shadow data.

How much does the LLM usage actually cost per month?

Depends on PR volume and model choice. A team doing 200 PRs a week with average diffs of 400 lines, running three review passes on Claude Sonnet, lands around $400-900/month in API costs. Switching the policy pass to a smaller model like Haiku or a self-hosted Qwen drops that 40-60%. We instrument token usage per repo and per pass so you can see exactly where spend goes.

Is this approach approved for federal or DoD use?

It can be. As an SDVOSB, we deploy this pattern for federal clients using FedRAMP-authorized model endpoints (Bedrock GovCloud, Azure Government OpenAI) or fully air-gapped open-weight models on accredited infrastructure. The reviewer itself runs in your boundary, logs to your SIEM, and respects your existing ATO. We have shipped this against IL4 and IL5 environments and can speak to the control mappings during scoping.

[ NEXT ACTION ]

Ship an AI code reviewer your engineers won't mute in week two.

Talk to a VooStack operator. We respond within one business day.