LLMs in regulated finance, built to survive an SR 11-7 review.
We integrate Claude, GPT, and Bedrock-hosted models into bank, asset manager, and insurer workflows with the model risk documentation, inference audit trail, and vendor assessments your regulator and internal MRM team will actually accept.
Most LLM pilots in finance die at the model risk committee, not in engineering.
The pattern is consistent. A line of business spins up a Claude or GPT proof-of-concept in two weeks. It demos well. Then it hits model risk management, third-party risk, internal audit, and compliance — and there's no model inventory entry, no validation evidence, no inference logging, no vendor assessment of the foundation model provider, and no documented distinction between augmentation and automated decisioning. Six months later the project is shelved or quietly running in shadow IT. The technology wasn't the problem. The governance scaffolding around it was never built.
- ▸ Prompts and outputs aren't logged in a way that lets auditors reconstruct a decision two years later under examination.
- ▸ Foundation model vendors haven't been through TPRM, so SOC 2, data residency, and sub-processor questions are unanswered.
- ▸ No documented threshold for when LLM output requires human review versus auto-action under ECOA or state AI bulletins.
- ▸ Model validation, drift monitoring, and revalidation cadence don't exist, so the system fails its first MRM annual review.
Treat the model like any other production risk: governed, logged, reversible.
- STEP-01
Map the decision surface first
Before any model call, we document which decisions the LLM touches and classify each as augmentation (human-in-loop) or automated decisioning. Automated paths trigger SR 11-7, ECOA, and fair-lending review. Most use cases stay augmentation-only on purpose.
- STEP-02
Build the inference audit trail
Every prompt, retrieved context chunk, model version, temperature, token count, and output gets written to an append-only store (typically S3 + Glue or Snowflake) with a hash chain. Auditors can replay any decision 7 years later, byte-for-byte.
- STEP-03
Wire explainability into the response
RAG citations are mandatory — outputs link to source paragraphs in the policy doc, filing, or claim record. For scoring tasks we pair the LLM with a SHAP-explainable gradient-boosted model so the regulator-facing rationale isn't just 'the LLM said so.'
- STEP-04
Vendor risk on the foundation model
We produce the third-party risk package your TPRM team actually needs: SOC 2 Type II for Anthropic/OpenAI/Bedrock, data residency attestations, zero-retention API configuration evidence, sub-processor lists, and a documented exit plan to a second provider behind an abstraction layer.
- STEP-05
Validate, monitor, re-validate
Independent model validation before launch (challenger tests, adversarial prompts, bias slices), then drift monitoring on output distributions, refusal rates, and citation accuracy. Annual revalidation is scheduled, not optional, and findings route to the model risk committee.
# Inference audit record — written before the response is returned to the user.
# This is what your model risk and internal audit teams will ask for in year 3.
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
import hashlib, json, uuid
@dataclass
class InferenceRecord:
inference_id: str
decision_class: str # "augmentation" | "automated_decision"
use_case_id: str # registered in model inventory
model_provider: str # e.g. "anthropic", "bedrock-anthropic"
model_version: str # pinned, never "latest"
prompt_template_hash: str
retrieved_doc_ids: list[str] # RAG citations, ordered
user_id: str
user_role: str
input_redacted: str # PII scrubbed pre-log
output: str
refusal: bool
latency_ms: int
cost_usd: float
prev_record_hash: str # append-only chain
created_at: str
def content_hash(self) -> str:
return hashlib.sha256(json.dumps(asdict(self), sort_keys=True).encode()).hexdigest()
def record(prev_hash: str, **kwargs) -> InferenceRecord:
return InferenceRecord(
inference_id=str(uuid.uuid4()),
prev_record_hash=prev_hash,
created_at=datetime.now(timezone.utc).isoformat(),
**kwargs,
) An append-only inference log with a hash chain is the single artifact that turns an LLM from a compliance liability into a defensible production system.
Field FAQ.
→ Does SR 11-7 actually apply to LLMs and RAG systems?
Yes, in almost every case we've seen at banks supervised by the Fed or OCC. SR 11-7 defines a model broadly as any quantitative method that produces output used in decision-making. An LLM summarizing a credit memo, classifying a complaint, or drafting a suitability rationale meets that bar. The relevant question isn't whether it applies — it's whether the use case is high, medium, or low materiality, which determines validation depth, documentation, and revalidation cadence.
→ What's the practical line between augmentation and automated decisioning?
If a human reviews and can override the output before it affects a customer, account, or filing, it's augmentation. If the model's output flows directly to action — auto-denying a claim, auto-pricing a policy, auto-routing a trade — it's automated decisioning, and you inherit ECOA adverse-action notice rules, state insurance AI bulletins (Colorado, NY DFS), and far heavier validation. We design most deployments to stay firmly on the augmentation side.
→ How do you handle explainability when the model is a black box?
We don't try to explain the transformer's internals — that's a losing fight. Instead we make the system explainable: every output cites the retrieved documents it used, the prompt template is versioned and reviewable, and for any quantitative score we pair the LLM with a separate interpretable model (logistic regression or GBM with SHAP) that produces the regulator-facing rationale. The LLM drafts; the explainable model justifies.
→ What does vendor risk assessment look like for Anthropic, OpenAI, or Bedrock?
Standard TPRM package plus AI-specific controls. We collect SOC 2 Type II reports, ISO 27001 certs, data processing agreements with zero-retention configured and evidenced, sub-processor lists, model training data attestations, and incident notification SLAs. For Bedrock and Azure OpenAI you also get the cloud provider's existing assessment, which shortens the cycle. We document an exit plan to a second provider behind a thin abstraction so vendor concentration doesn't become a finding.
→ Can the foundation model provider see our customer data?
Configured correctly, no. Anthropic, OpenAI enterprise, Bedrock, and Azure OpenAI all support zero-retention modes where prompts and outputs are not logged on their side and not used for training. We verify this in writing in the contract, configure it at the API level, and test it with canary data. For highest-sensitivity workloads we recommend Bedrock or Azure OpenAI inside your existing cloud tenancy so data never leaves your VPC boundary.
→ How long does a compliant deployment actually take?
For a single well-scoped use case — say, summarizing commercial loan packages for credit officers — six to ten weeks from kickoff to production, assuming model risk management is engaged from week one. Half that time is documentation, validation, and TPRM, not code. Teams who skip the governance work ship in three weeks and then spend six months unwinding it after the first audit. We've cleaned up enough of those to know which path is faster end-to-end.
→ What use cases are working in production at banks and insurers right now?
Document-heavy back-office work: KYC narrative drafting, suitability memo generation, claims first-notice-of-loss summarization, complaint classification and routing, policy document Q&A for service reps, internal audit evidence retrieval, and code modernization for legacy COBOL or PL/SQL. Customer-facing chat exists but is heavily constrained. Anything touching credit decisions, underwriting outcomes, or trade execution stays augmentation with a human signoff — by design, not by limitation.
→ Does veteran-owned / SDVOSB status matter for commercial financial services work?
For commercial banks and insurers, SDVOSB status is mostly irrelevant to the procurement decision — they care about the work and the references. Where it matters: federal financial regulators (OCC, FDIC, Treasury, FHFA), GSEs, and any bank pursuing federal contract subcontracting credit can route work to us under set-aside vehicles. If you're a commercial institution with a federal-facing subsidiary or government contracts, we can support both sides under one engagement.
→ Who owns the model risk documentation when the engagement ends?
You do, fully. Model development documents, validation reports, monitoring playbooks, prompt templates, eval datasets, and the inference audit schema are all delivered as artifacts in your repos and your model inventory system — not ours. We've handed off to internal MRM teams, third-party validators, and successor consultancies without issue. If your model risk officer can't run the system without us after go-live, we built it wrong.
Continue recon.
AI integration services
How we scope, build, and govern LLM deployments inside regulated environments.
REL-02Past engagements
Selected examples of modernization and AI work shipped under audit and compliance constraints.
REL-03Fixed-scope packages
Pre-priced engagements for RAG pilots, model risk documentation, and TPRM packages.
REL-04Talk to an engineer
Bring a specific use case. We'll tell you whether it clears MRM in one call.
Bring us your model risk committee's hardest questions. We've answered them before.
Talk to a VooStack operator. We respond within one business day.