The Architecture Behind Private AI for Regulated Practices

There's a specific query that defines the problem.

A physician wants to ask: given this patient's history -- the comorbidities, the prior treatment response, the labs from last week -- what does the current literature say about escalating to X? An attorney wants to ask: given how the Harmon matter resolved, and the contract language in front of me now, where's the exposure?

Both queries are where AI would actually earn its keep. Both queries are exactly what you cannot send to UpToDate Expert AI, Westlaw CoCounsel, or any other cloud product without a compliance review most practices have never done -- and that some practices can't pass even if they try.

The wall isn't a policy problem. It's an architectural one.

What every commercial product in this space has in common

UpToDate Expert AI, Westlaw CoCounsel, LexisNexis Protégé -- structurally, they're the same product. A frontier model, hosted in a cloud data center, answering from a publisher's proprietary corpus. Send them a query and the text of that query, including anything you paste into it, leaves your building. For general questions -- what does the literature say about drug X in isolation -- that's a reasonable trade. The moment patient context or client facts enter the conversation, you've moved regulated data outside your control.

Under HIPAA, that's a disclosure event unless a Business Associate Agreement is in place and the vendor is properly configured. Under attorney-client privilege, it's a potential waiver. OCR's December 2024 Security Rule update added a sharper edge: AI systems now appear in the mandatory technology asset inventory, and shadow AI -- staff using consumer tools without IT oversight -- is an explicit enforcement category. "Assume it is occurring" is the quoted language.

The enterprise BAA versions of these tools don't resolve it. They route queries through AWS or Azure under contract. The data still moves. The compliance question is still settled by paperwork, and paperwork has exceptions, breach scenarios, and subpoenas.

There is one configuration where the compliance question is settled by physics: the model runs on hardware inside the building, and the data never crosses a network boundary.

That's what I built. Here's what's actually running.

The inference engine

ik_llama.cpp. Not vLLM, not Ollama.

vLLM is the right engine for a lot of deployments -- batched inference at scale, continuous batching, well-maintained. It's wrong for this one. Qwen3-35B-A3B is a mixture-of-experts model: 35 billion total parameters, 3.6 billion active per inference. vLLM cannot gracefully offload MoE expert layers to CPU RAM when VRAM pressure forces it. On a 24GB RTX 4090, that constraint is real. ik_llama.cpp handles the expert routing correctly and runs the model fully GPU-resident at Q4_K_M quantization without the offload problem.

The other reason is the thinking mode toggle. Qwen3 ships with native reasoning mode -- extended chain-of-thought that runs before the visible response. ik_llama.cpp exposes that toggle at runtime. What would otherwise require routing between two models (fast for simple queries, slow/reasoning for complex ones) collapses into a single model with a flag. That's a meaningful architecture simplification when you're running a managed service for a 10-attorney firm that doesn't have an ML team.

Production throughput: Q4_K_M on an RTX 4090, fully GPU-resident, approximately 100 tokens per second in non-thinking mode. Thinking mode trades speed for reasoning depth on the queries that warrant it. The dev unit -- RTX 4000 Ada, 20GB VRAM -- runs Q3_K_M at 50-70 tok/s in thinking mode. Both fit without offload.

Model: Qwen3-35B-A3B, Apache 2.0 license. The license matters. Models with restrictive commercial terms create compliance liability on top of the HIPAA liability you're already managing. Apache 2.0 is the only license that doesn't require a separate legal review before deploying in a healthcare or legal environment. Native 262K context window. The reasoning tier activation is runtime, not a separate model weight.

The retrieval layer

This is the part the commercial tools can never close, regardless of how their compliance posture evolves.

UpToDate Expert AI answers from Wolters Kluwer content. Westlaw CoCounsel answers from Thomson Reuters content. Neither has access to the firm's prior matters, the practice's patient population history, or the internal protocols built over years of clinical or legal practice. That corpus exists only inside the building. Getting AI to reason over it requires the model to be inside the building too.

The private Qdrant instance ingests that corpus -- case files, prior matters, patient records, internal documents -- and keeps it local. When a physician asks about this patient's response to a prior treatment, the retrieval layer pulls the relevant records from Qdrant and surfaces them as context for the inference call. The model reasons over the patient's actual history, not a generic population. Same structure for legal: prior matter outcomes, firm-specific contract language patterns, client history.

The combination -- private inference plus private retrieval -- is what makes the query at the top of this post possible. Neither piece alone gets you there.

The API landscape

The commercial products are UI subscriptions, not programmatic data sources at SMB pricing. Westlaw has no RAG API at a price a 10-attorney firm can justify. LexisNexis has a developer portal; it's enterprise-only. UpToDate Connect is an EHR vendor API, not an MSP API.

The workflow for those tools is: the attorney retrieves content from their existing subscription, uploads the document, and the private model reasons over it alongside the client corpus. The model's value is the reasoning layer, not API resale.

The free APIs are the programmatic backbone:

PubMed / NCBI E-utilities -- free with an API key, 10 requests per second, 36 million citations. PubMed Central full-text is moving to AWS open access in August 2026, no authentication required.

ClinicalTrials.gov -- free REST API, FHIR JSON output, 500K+ trials, modernized August 2025.

CourtListener (Free Law Project) -- free tier plus commercial licensing, 8 million opinions across all federal and state courts. Their Semantic Search API launched November 2025. MCP server is available. This fills the programmatic case law gap that no Westlaw API exists to fill.

Together, those three cover the public literature backbone for both verticals at zero marginal cost. Everything firm-specific or practice-specific lives in Qdrant.

The five gaps

These aren't feature bullets. They're the structural argument for why this configuration exists and why the commercial tools can't converge on it even if they wanted to.

The BAA gap is first because it's the one that forecloses the others. If the data has to leave the building, you're managing a cloud compliance problem indefinitely -- BAA scope, configuration audits, breach scenarios. The private stack removes that surface by removing the network crossing. OCR can audit the server room. There's nothing to audit at the cloud provider because there's no cloud provider.

The corpus gap is what makes the system useful for the queries that matter. The commercial tools know the literature. They don't know the firm. Private Qdrant RAG over the client's own document corpus is the capability that converts a sophisticated research tool into something that actually reasons about this case.

The audit trail gap is what makes the system defensible. HIPAA and malpractice defense both require documented reasoning chains. Every inference call logs locally: who asked, what the model was given, what it returned, what sources it consulted, timestamp, user attribution. That log is client property. It's producible to a regulator or opposing counsel. UpToDate logs platform access. It does not log AI reasoning.

The shadow AI gap is the one practices underestimate. Staff will use ChatGPT regardless of the policy document. OCR has already said so. The managed gateway -- one interface, classification layer routes sensitive queries to on-prem and non-sensitive queries to frontier models with logging -- replaces the policy with a system. The boundary enforces itself instead of relying on willpower.

The SMB pricing gap is why this doesn't already exist from the enterprise vendors. UpToDate is deployed at 50+ major health systems. Westlaw Advantage is $400 per seat per month. The product roadmap at those companies is not a 10-physician practice in Sherman Oaks. IT CloudLink is.

What this doesn't solve

Honest accounting: this stack doesn't give you a Westlaw-quality legal research engine or an UpToDate-quality clinical decision support engine. The commercial tools have decades of editorial curation behind them. CourtListener is extensive; it's not Westlaw. PubMed is comprehensive; it's not UpToDate Expert AI with the full Wolters Kluwer editorial layer.

What the private stack does is handle the queries the commercial tools can't touch -- the ones where patient context or client facts have to be in the conversation for the answer to be useful. For pure literature research with no PHI or privilege attached, the existing subscriptions are the right tool. The private stack sits alongside them, not instead of them.

That's the architecture. The full interactive reference -- competitive matrix, API landscape, tier hardware specs, and service data flow -- is the diagram: IT CloudLink Private LLM Service Architecture.

The non-technical version -- written for the practice owner rather than whoever built the thing -- is on the IT CloudLink blog.

IT CloudLink Private LLM Service -- the practice-facing write-up.

Chris Rondthaler is the founder of IT CloudLink and runs Rondthaler Labs. YouTube: @aiopsdude