[ AI INTEGRATION ] // FINANCIAL SERVICES

LLMs in regulated finance, built to survive an SR 11-7 review.

We integrate Claude, GPT, and Bedrock-hosted models into bank, asset manager, and insurer workflows with the model risk documentation, inference audit trail, and vendor assessments your regulator and internal MRM team will actually accept.

Veteran-Owned SDVOSB
[001 / 005] Field Conditions

Most LLM pilots in finance die at the model risk committee, not in engineering.

// SITUATION

The pattern is consistent. A line of business spins up a Claude or GPT proof-of-concept in two weeks. It demos well. Then it hits model risk management, third-party risk, internal audit, and compliance — and there's no model inventory entry, no validation evidence, no inference logging, no vendor assessment of the foundation model provider, and no documented distinction between augmentation and automated decisioning. Six months later the project is shelved or quietly running in shadow IT. The technology wasn't the problem. The governance scaffolding around it was never built.

  • Prompts and outputs aren't logged in a way that lets auditors reconstruct a decision two years later under examination.
  • Foundation model vendors haven't been through TPRM, so SOC 2, data residency, and sub-processor questions are unanswered.
  • No documented threshold for when LLM output requires human review versus auto-action under ECOA or state AI bulletins.
  • Model validation, drift monitoring, and revalidation cadence don't exist, so the system fails its first MRM annual review.
SR 11-7
Aligned model documentation from day one
7 yr
Inference audit trail retention by default
< 10 wks
Typical time to production for a governed use case
[002 / 005] Operational Approach

Treat the model like any other production risk: governed, logged, reversible.

  1. STEP-01

    Map the decision surface first

    Before any model call, we document which decisions the LLM touches and classify each as augmentation (human-in-loop) or automated decisioning. Automated paths trigger SR 11-7, ECOA, and fair-lending review. Most use cases stay augmentation-only on purpose.

  2. STEP-02

    Build the inference audit trail

    Every prompt, retrieved context chunk, model version, temperature, token count, and output gets written to an append-only store (typically S3 + Glue or Snowflake) with a hash chain. Auditors can replay any decision 7 years later, byte-for-byte.

  3. STEP-03

    Wire explainability into the response

    RAG citations are mandatory — outputs link to source paragraphs in the policy doc, filing, or claim record. For scoring tasks we pair the LLM with a SHAP-explainable gradient-boosted model so the regulator-facing rationale isn't just 'the LLM said so.'

  4. STEP-04

    Vendor risk on the foundation model

    We produce the third-party risk package your TPRM team actually needs: SOC 2 Type II for Anthropic/OpenAI/Bedrock, data residency attestations, zero-retention API configuration evidence, sub-processor lists, and a documented exit plan to a second provider behind an abstraction layer.

  5. STEP-05

    Validate, monitor, re-validate

    Independent model validation before launch (challenger tests, adversarial prompts, bias slices), then drift monitoring on output distributions, refusal rates, and citation accuracy. Annual revalidation is scheduled, not optional, and findings route to the model risk committee.

// PYTHON PATTERN
# Inference audit record — written before the response is returned to the user.
# This is what your model risk and internal audit teams will ask for in year 3.

from dataclasses import dataclass, asdict
from datetime import datetime, timezone
import hashlib, json, uuid

@dataclass
class InferenceRecord:
    inference_id: str
    decision_class: str          # "augmentation" | "automated_decision"
    use_case_id: str             # registered in model inventory
    model_provider: str          # e.g. "anthropic", "bedrock-anthropic"
    model_version: str           # pinned, never "latest"
    prompt_template_hash: str
    retrieved_doc_ids: list[str] # RAG citations, ordered
    user_id: str
    user_role: str
    input_redacted: str          # PII scrubbed pre-log
    output: str
    refusal: bool
    latency_ms: int
    cost_usd: float
    prev_record_hash: str        # append-only chain
    created_at: str

    def content_hash(self) -> str:
        return hashlib.sha256(json.dumps(asdict(self), sort_keys=True).encode()).hexdigest()

def record(prev_hash: str, **kwargs) -> InferenceRecord:
    return InferenceRecord(
        inference_id=str(uuid.uuid4()),
        prev_record_hash=prev_hash,
        created_at=datetime.now(timezone.utc).isoformat(),
        **kwargs,
    )

An append-only inference log with a hash chain is the single artifact that turns an LLM from a compliance liability into a defensible production system.

[003 / 005] Common Questions

Field FAQ.

Does SR 11-7 actually apply to LLMs and RAG systems?

Yes, in almost every case we've seen at banks supervised by the Fed or OCC. SR 11-7 defines a model broadly as any quantitative method that produces output used in decision-making. An LLM summarizing a credit memo, classifying a complaint, or drafting a suitability rationale meets that bar. The relevant question isn't whether it applies — it's whether the use case is high, medium, or low materiality, which determines validation depth, documentation, and revalidation cadence.

What's the practical line between augmentation and automated decisioning?

If a human reviews and can override the output before it affects a customer, account, or filing, it's augmentation. If the model's output flows directly to action — auto-denying a claim, auto-pricing a policy, auto-routing a trade — it's automated decisioning, and you inherit ECOA adverse-action notice rules, state insurance AI bulletins (Colorado, NY DFS), and far heavier validation. We design most deployments to stay firmly on the augmentation side.

How do you handle explainability when the model is a black box?

We don't try to explain the transformer's internals — that's a losing fight. Instead we make the system explainable: every output cites the retrieved documents it used, the prompt template is versioned and reviewable, and for any quantitative score we pair the LLM with a separate interpretable model (logistic regression or GBM with SHAP) that produces the regulator-facing rationale. The LLM drafts; the explainable model justifies.

What does vendor risk assessment look like for Anthropic, OpenAI, or Bedrock?

Standard TPRM package plus AI-specific controls. We collect SOC 2 Type II reports, ISO 27001 certs, data processing agreements with zero-retention configured and evidenced, sub-processor lists, model training data attestations, and incident notification SLAs. For Bedrock and Azure OpenAI you also get the cloud provider's existing assessment, which shortens the cycle. We document an exit plan to a second provider behind a thin abstraction so vendor concentration doesn't become a finding.

Can the foundation model provider see our customer data?

Configured correctly, no. Anthropic, OpenAI enterprise, Bedrock, and Azure OpenAI all support zero-retention modes where prompts and outputs are not logged on their side and not used for training. We verify this in writing in the contract, configure it at the API level, and test it with canary data. For highest-sensitivity workloads we recommend Bedrock or Azure OpenAI inside your existing cloud tenancy so data never leaves your VPC boundary.

How long does a compliant deployment actually take?

For a single well-scoped use case — say, summarizing commercial loan packages for credit officers — six to ten weeks from kickoff to production, assuming model risk management is engaged from week one. Half that time is documentation, validation, and TPRM, not code. Teams who skip the governance work ship in three weeks and then spend six months unwinding it after the first audit. We've cleaned up enough of those to know which path is faster end-to-end.

What use cases are working in production at banks and insurers right now?

Document-heavy back-office work: KYC narrative drafting, suitability memo generation, claims first-notice-of-loss summarization, complaint classification and routing, policy document Q&A for service reps, internal audit evidence retrieval, and code modernization for legacy COBOL or PL/SQL. Customer-facing chat exists but is heavily constrained. Anything touching credit decisions, underwriting outcomes, or trade execution stays augmentation with a human signoff — by design, not by limitation.

Does veteran-owned / SDVOSB status matter for commercial financial services work?

For commercial banks and insurers, SDVOSB status is mostly irrelevant to the procurement decision — they care about the work and the references. Where it matters: federal financial regulators (OCC, FDIC, Treasury, FHFA), GSEs, and any bank pursuing federal contract subcontracting credit can route work to us under set-aside vehicles. If you're a commercial institution with a federal-facing subsidiary or government contracts, we can support both sides under one engagement.

Who owns the model risk documentation when the engagement ends?

You do, fully. Model development documents, validation reports, monitoring playbooks, prompt templates, eval datasets, and the inference audit schema are all delivered as artifacts in your repos and your model inventory system — not ours. We've handed off to internal MRM teams, third-party validators, and successor consultancies without issue. If your model risk officer can't run the system without us after go-live, we built it wrong.

[ NEXT ACTION ]

Bring us your model risk committee's hardest questions. We've answered them before.

Talk to a VooStack operator. We respond within one business day.