Custom EMR Development for Healthcare
Purpose-built EMR: clinical workflows, HIPAA-aware design, FHIR interoperability, and AI-ready architecture for modern care delivery.
Read articleLarge language models can read a discharge summary in milliseconds and surface insights that take a clinician hours. The catch: every line of that summary is Protected Health Information (PHI). Send it to a model that wasn't covered by a Business Associate Agreement, log it to the wrong CloudWatch group, or fine-tune on it without consent, and the cost is no longer a curiosity — it is a HIPAA breach.
De-identification is the engineering discipline that lets healthcare teams use LLMs without exposing patient identity. Done well, it unlocks safe model evaluation, vendor flexibility, and reusable training data. Done poorly, it creates a false sense of safety while quietly leaking identifiers in free text.
This guide walks through the two HIPAA standards, the architecture of a production de-identification pipeline, the tools worth knowing, and the failure modes we see most often when teams try to retrofit privacy onto an existing LLM workflow.
HIPAA recognizes two — and only two — paths to de-identified data. Picking the wrong one is the most common upstream mistake we see.
Remove all 18 specified identifiers (names, geographic units smaller than a state, dates more specific than year, phone, email, MRNs, biometric IDs, full-face photos, etc.) and confirm no actual knowledge that the residual data could re-identify an individual. Mechanical, auditable, conservative.
A qualified statistician applies generally accepted principles and documents that the risk of re-identification is "very small." More flexible — preserves more analytic signal — but requires formal documentation and periodic re-review.
A "Limited Data Set" is not de-identified — it still contains dates and geographic detail and requires a Data Use Agreement. Don't conflate the two when scoping LLM access.
A healthcare LLM pipeline should treat de-identification as a discrete stage with its own trust boundary, not as a function call buried inside the model adapter.
All raw clinical input — HL7 v2 messages, FHIR bundles, scanned PDFs, dictated notes — lands in a PHI-class S3 bucket inside a HIPAA-eligible VPC. Encryption at rest with a customer-managed KMS key, and an explicit BAA with every upstream system.
For structured payloads (FHIR resources, EHR exports), strip the 18 Safe Harbor identifiers programmatically. Generate a stable pseudonym per patient using HMAC-SHA256 with a key stored in KMS — never a hash of the MRN alone, which is reversible by guessing.
Clinical notes are where most leaks happen. Run a clinical-NER model (Presidio, AWS Comprehend Medical, NeuroNER, or a fine-tuned spaCy pipeline) to detect names, dates, locations, ages over 89, and embedded identifiers. Replace tokens with type-tagged surrogates ([NAME_1], [DATE_1]) so downstream models can still reason about structure.
Per-patient date offset (a random integer in days, sealed with the same KMS-protected secret) preserves intervals between events while making absolute dates unrecoverable. Generalize ZIPs to the first three digits — and drop them entirely for the ~17 ZIP3 areas with populations under 20,000.
Run a second, independent detector over the redacted output. If the secondary detector finds a residual identifier, the record is quarantined for review — never silently let through. Sample human review on 1–2% of records for drift detection.
Cleared records move to a separate VPC, separate bucket, separate IAM boundary. This is the only zone that talks to general-purpose LLM endpoints. The crosswalk that maps pseudonyms back to MRNs lives only in the PHI zone, accessible to a narrow set of services with a documented re-identification workflow.
| Tool | Strength | Watch out for |
|---|---|---|
| AWS Comprehend Medical (DetectPHI) | HIPAA-eligible, fully managed, decent recall on common PHI | Per-character pricing adds up; English-only; tune for your note style |
| Microsoft Presidio | Open source, extensible recognizers, runs in your VPC | Default models are general-purpose — clinical recall needs custom recognizers |
| Google Cloud Healthcare DLP | Strong template library, structured + unstructured | Requires GCP BAA; cross-cloud egress costs and audit complexity |
| Philter / NLM Scrubber | Designed specifically for clinical text | Lower throughput; integration work for production scale |
| Custom fine-tuned NER (spaCy / BERT) | Highest recall on your specific note types | Training data itself is PHI — do the work inside your PHI zone |
No single detector hits 100% recall on free text. The realistic target is layered defense: a managed service for breadth, a custom model for your local idioms, and a verification gate to catch what the first pass missed.
Stripping identifiers from inputs is necessary but not sufficient. LLM workflows introduce four risks that traditional de-identification was never designed to address.
A "72-year-old left-handed concert pianist with bilateral hip replacements" is identifiable in a small enough population, even with no name attached. Expert Determination addresses this; Safe Harbor alone does not.
If you fine-tune on de-identified data, the model can still memorize rare phrases verbatim. Use differential privacy during training (DP-SGD) or restrict fine-tuning to data already cleared by Expert Determination.
A clinician's prompt — "What did Mrs. Garcia's last visit say about her A1C?" — re-injects PHI even when the underlying record is de-identified. Wrap user prompts with the same redaction layer you apply to ingest, or keep that path inside the PHI zone with a BAA-covered model.
LLM observability tools love to capture full prompt/response pairs. If those tools aren't under BAA, you've just exfiltrated PHI to a third party. Audit every logger, tracer, and eval harness in the path before it touches production data.
De-identification is the gate that decides which parts of a healthcare LLM stack live inside HIPAA's shared-responsibility model and which can move freely. Treat it as architecture, not a utility function: distinct zones, layered detectors, verified outputs, and a documented standard.
The teams that get this right don't just avoid breaches — they unlock the rest of the AI roadmap. Once de-identification is reliable, model selection, vendor flexibility, and training-data reuse all become engineering decisions instead of compliance fights.
Our healthcare AI engineers design de-identification pipelines, architect PHI/de-identified zones on AWS, and stand up evaluation harnesses that keep PHI out of third-party tooling.
Get a Free Pipeline AssessmentHIPAA Software DevelopmentPart of our Healthcare Technology series
EMR, EHR, telehealth, and HIPAA-aware healthcare software. Explore the full service and scope a first engagement with our team.
Continue exploring this topic with more articles from the same series.
Purpose-built EMR: clinical workflows, HIPAA-aware design, FHIR interoperability, and AI-ready architecture for modern care delivery.
Read articleArchitecture, secure video, remote monitoring, and EMR integration for seamless virtual care delivery.
Read articleEncryption, access controls, and incident response that protect PHI for USA health systems and California healthcare providers.
Read article