AI surveillance that survives an audit — not a flag-everything firehose.
We build communications and transaction monitoring on Claude, GPT-4, and open-weight models with calibrated false-positive rates, full audit trails, and a reviewer queue your compliance team will actually use across Slack, Teams, email, Salesforce, and core banking systems.
Most AI compliance pilots flag everything, prove nothing, and get shelved after the first audit.
The pattern is predictable. A vendor sells an LLM-based surveillance tool. It's pointed at Slack, Teams, or email and starts firing thousands of alerts a week. Compliance reviewers triage 200 a day, dismiss 95% as noise, and stop trusting the queue by month two. When examiners ask how a specific decision was made — which policy, which model version, which reviewer, what rationale — nobody can reconstruct it. The tool gets quietly turned off and the firm goes back to keyword lexicons from 2014.
- ▸ LLM flags every mention of "guarantee" or "insider" with no context, burying real violations in noise reviewers learn to ignore.
- ▸ No versioning of prompts, policies, or models — when an examiner asks why an item was dismissed in March, nobody can answer.
- ▸ Reviewer dispositions don't feed back into the system, so the same false-positive pattern fires every week for a year.
- ▸ Audit log is a CloudWatch dump, not an immutable record tied to policy version, model version, and named reviewer.
Build the review queue first. The model is the easy part.
- STEP-01
Map the policy to evidence
Sit with compliance officers and translate each rule (FINRA 2210, HIPAA 164.502, Title VII harassment, AML typologies) into observable signals in Slack, Teams, email, Salesforce notes, or transaction logs. No signal, no rule. Document the mapping in a policy registry that the model references at inference time.
- STEP-02
Two-stage detection: cheap then smart
Stage one is deterministic — regex, lexicons, amount thresholds, sanctions list joins. Stage two routes only the survivors to an LLM with a structured rubric and the relevant policy text in context. This cuts inference cost 80-95% and gives you a defensible rules layer when auditors ask why something fired.
- STEP-03
Calibrate false positives explicitly
We tune for precision, not recall theater. Each rule gets a target FP rate (typically 5-15%), measured weekly against reviewer dispositions. Rules that exceed budget get re-prompted, re-scoped, or retired. Reviewers tag every dismissal with a reason code that feeds back into prompt revisions and few-shot examples.
- STEP-04
Reviewer workflow with full context
Flagged items land in a queue (custom UI or Hubspot/ServiceNow/Jira depending on your stack) with the original message, surrounding thread, matched policy clause, model rationale, and one-click disposition: escalate, dismiss, request-info. Every action writes to an append-only audit log with reviewer ID, timestamp, and justification.
- STEP-05
Audit trail and model attestation
Every decision — model version, prompt hash, policy version, input hash, output, reviewer override — is written to immutable storage (S3 Object Lock or equivalent). When examiners arrive, you reproduce any decision from 18 months ago in under a minute. This is the deliverable that actually matters.
from dataclasses import dataclass
from datetime import datetime
import hashlib, json
@dataclass
class FlagDecision:
item_id: str
policy_id: str # e.g. "FINRA-2210-promissory"
policy_version: str # git SHA of policy registry
model: str # "claude-3-5-sonnet-20241022"
prompt_hash: str # sha256 of rendered prompt
severity: str # low | medium | high
rationale: str # model's structured explanation
matched_spans: list # [(start, end, clause_id), ...]
reviewer_id: str | None = None
disposition: str | None = None # escalate|dismiss|info
dismiss_reason: str | None = None
decided_at: datetime | None = None
def write_audit(decision: FlagDecision, raw_input: str):
record = {
**decision.__dict__,
"input_sha256": hashlib.sha256(raw_input.encode()).hexdigest(),
"written_at": datetime.utcnow().isoformat(),
}
# S3 Object Lock bucket, compliance retention mode
s3.put_object(
Bucket="voostack-compliance-audit",
Key=f"flags/{decision.item_id}.json",
Body=json.dumps(record),
ObjectLockMode="COMPLIANCE",
ObjectLockRetainUntilDate=seven_years_out(),
) The audit record — not the model output — is the regulatory artifact. Design this schema before you write a single prompt.
Field FAQ.
→ How do you keep false positives from drowning the review team?
We set an explicit precision target per rule — usually 85-95% — and measure it weekly against reviewer dispositions. Rules that exceed the FP budget get re-scoped: tighter prompts, additional pre-filters, narrower context windows, or retirement. We also separate severity tiers so a low-confidence flag goes to a triage queue instead of paging a senior reviewer at 2am. The goal is not zero false positives; it's a defensible, measured rate that reviewers can sustain.
→ Which regulations have you built monitoring for?
Common ones: FINRA 2210 and 3110 for broker-dealer communications, SEC Marketing Rule, HIPAA 164.502 for PHI in support tickets and email, AML/BSA transaction typologies, GDPR data subject mentions, Title VII and harassment policy in HR channels, and internal policies like insider trading windows or MNPI handling. The pattern is the same — translate the rule into observable signals, then layer deterministic checks before any LLM call.
→ Can the model output be used as evidence in a regulatory exam?
The model output alone, no. The audit record is the evidence: input hash, prompt hash, model version, policy version, rationale, matched policy clauses, reviewer disposition, and timestamp — all written to immutable storage with retention locks. Examiners care about reproducibility and human accountability. We design so any flag from 18+ months ago can be reconstructed and explained by a named human reviewer, not a black box.
→ Do you send our communications data to OpenAI or Anthropic?
Depends on your data classification. For most regulated workloads we use Azure OpenAI, Bedrock (Claude), or Vertex with zero-retention agreements and data residency in your region. For the most sensitive channels — PHI, classified-adjacent, or attorney-client — we deploy open-weight models (Llama 3.1, Qwen) on your own VPC or on-prem GPUs. The architecture is the same; only the inference endpoint changes.
→ How long does a first deployment take?
A focused pilot on one channel and 3-5 policies typically ships in 4-6 weeks: two weeks on policy mapping and rule design, two weeks building the detection pipeline and reviewer queue, one to two weeks tuning against historical data. Production rollout to additional channels and policies is incremental from there. We don't sell 9-month transformations — if it can't show value in a quarter, the scope is wrong.
→ What does the reviewer interface actually look like?
A queue grouped by severity, with each item showing the flagged message, surrounding thread context, the specific policy clause matched, the model's structured rationale, and three buttons: escalate, dismiss with reason, request more info. We usually build it as a thin custom UI, but we've also embedded into ServiceNow, Jira, and Salesforce Service Cloud when clients want to keep reviewers in their existing tool.
→ Is this an SDVOSB-eligible engagement for federal work?
Yes. VooStack is SDVOSB-certified and eligible for sole-source awards up to the SDVOSB threshold and set-aside competitions. We've shipped compliance monitoring patterns under FedRAMP-aligned environments and can work in IL4/IL5 boundaries with the right partner cloud. If you're a federal agency or a prime looking for an SDVOSB sub on a compliance or insider-threat scope, contracting is straightforward.
→ How do you handle model drift and policy changes?
Both are versioned in git. Policies live in a registry where each clause has an ID, version, and effective date; prompts reference clause IDs, not free text. When a policy changes, the registry bumps, prompts re-render, and a regression suite runs against a labeled gold set before promotion. Model drift is caught by the weekly precision/recall report — if numbers move, we investigate before reviewers feel it.
→ Can we start with one channel and expand?
That's the recommended path. Pick the channel with the highest regulatory exposure or the loudest reviewer pain — usually email surveillance, Slack/Teams for HR, or transaction narrative review for AML. Get the detection, queue, and audit trail working there. Once reviewers trust the system and you have weekly precision metrics, adding the next channel is mostly configuration plus a new policy mapping, not a rebuild.
Continue recon.
AI Integration Services
How we wire Claude, GPT, and RAG into regulated workflows without breaking audit posture.
REL-02Shipped Engagements
Concrete examples of monitoring, modernization, and AI integration we've delivered.
REL-03Fixed-Scope Packages
Defined-price pilots for compliance monitoring and AI workflow integration.
REL-04Talk To An Engineer
Skip the sales call. Get a 30-minute scoping conversation with someone who's shipped this.
Have a compliance backlog and a model that flags everything? Let's fix the precision problem.
Talk to a VooStack operator. We respond within one business day.