What "Privacy" actually means for data use in AI Training

A primer for teams training models with privacy regulated data.

Real-world data is where the next ceiling lifts. The models that matter over the next two years will train on claims, EHR, behavioral, and longitudinal data. The regulated, deeply human data that public web text and synthetic generation don’t contain. The signal that’s still on the table sits in regulated sources, and most of it sits behind a methodology layer that AI teams either treat as paperwork to clear or don’t think about until something downstream forces them to.

That methodology layer is privacy engineering. Specifically, it’s the question of what “de-identified” means for the data you seek to train on, because “de-identified” isn’t one thing. It’s shorthand for at least four distinct methodologies, with different artifacts, different defensibility profiles, and different consequences for what data you can use, what signal survives, and what the dataset is allowed to do downstream.

The teams getting furthest with real world data right now aren’t the ones with the most aggressive privacy posture, and they’re not the ones treating privacy as a compliance checkbox. They’re the ones treating methodology as infrastructure, who know which method produced their dataset, what it preserved, what it stripped, and what it can answer when an enterprise procurement team, a government contractor, or a regulator eventually asks.

This is a primer for that.

The four things “de-identified” can actually mean

Safe Harbor — HIPAA §164.514(b)(2)

The default. Eighteen specific categories of identifiers get removed. Names, geographic units smaller than the first three ZIP digits, dates more granular than year, ages over 89, and so on. It’s a checklist. It’s the most common interpretation you’ll see in vendor documentation, and it’s the one most teams assume when they hear “de-identified.”

Safe Harbor was designed for traditional data releases. Point-in-time, controlled recipient, limited composition. In that setting, it does the job.

The AI training context is different in two ways that matter.

The first is what Safe Harbor strips. The eighteen categories include exactly the fields models need to learn from. Date granularity. Geographic detail. Longitudinal markers. Cohort detail. The signal you bought the data for. Safe Harbor preserves the structure of the data while removing the parts that make the structure informative. The data still exists. It’s just less useful for what you’re trying to train.

The second is what Safe Harbor doesn’t address. The eighteen categories cover known direct and indirect identifiers. They don’t account for quasi-identifier combinations that become identifying at population scale. They don’t address composition risk when sources are joined. They don’t model membership inference patterns that emerge when models are exposed to adversarial queries. Safe Harbor wasn’t scoped for any of those, and they’re exactly the failure modes that surface when downstream buyers, regulators, or plaintiffs’ counsel start asking about defensibility.

So a Safe Harbor pipeline can leave you with data that’s both less useful for training and less defensible than the documentation suggests. Worth knowing before building on top of it.

Expert Determination — HIPAA §164.514(b)(1)

The other method HIPAA defines. A qualified statistical expert assesses the specific dataset, in its specific context, for its specific recipient class, and produces a signed determination that the risk of re-identification is “very small.”

Expert Determination is the method that keeps the signal. It doesn’t strip categories of fields wholesale. It assesses risk based on the actual composition of the data and the actual conditions of use, then calibrates privacy techniques against the residual risk that matters. The longitudinal join can stay. The rare cohort can stay. The geographic granularity the model needs can stay, scoped to who’s receiving the data and how it’s being used.

The artifact is a signed opinion, tied to dataset, model context, and recipient class. It answers the question downstream buyers actually ask: who independently assessed this data, against what risk surface, for what use.

It costs more than running a Safe Harbor script. It requires methodology, not a checklist. Whether you need it depends on what you’re training, who’s eventually going to ask about the data, and what you need the dataset to be able to do. For internal R&D in a closed loop, it’s overkill. For data that will surface to enterprise buyers, government contractors, or external commercialization, it’s usually the artifact those reviewers want.

Contractual de-identification

A lot of what AI teams buy today falls in this category. Data processed by a vendor’s internal tooling, usually some combination of PII detection and rule-based redaction, and attested as “de-identified” under a DUA or licensing agreement. No statutory standard. The vendor’s process is the standard, and the contract is the documentation.

Not inherently weaker than the other methods. Some vendors run real privacy engineering. But the artifact is different. The vendor attests. Whether an independent qualified expert ever assessed the data for your specific use is a separate question, and the answer is usually no.

For internal R&D with controlled distribution, this is often enough. For data that will be sold, licensed downstream, or fed into models that get deployed externally, the gap between “the vendor attested” and “an independent expert assessed for our recipient class” can show up later as a procurement problem.

Pseudonymization (including tokenization)

Direct identifiers replaced with stable tokens that let the same record be linked across sources by whoever holds the key. Tokenization, the technique behind most modern record-linkage products, is a pseudonymization method.

Pseudonymization isn’t de-identification under HIPAA. It’s a different operation with a different purpose: enable linkage. It’s what lets you join claims to EHR to behavioral, build longitudinal records, do the multi-source assembly that makes regulated data valuable for AI in the first place.

The key exists, so identifiability risk carries through. The de-identification of the linked dataset is a separate question, requiring a separate methodology, producing a separate artifact. The common point of confusion in pipelines built on tokenized sources is treating the tokenization vendor’s documentation as if it covers both jobs. It doesn’t.

Why the language matters more in AI than in traditional data work

In a traditional data release, getting the methodology wrong at acquisition was usually a fixable problem. Legal review caught it. The dataset went back or got reprocessed. The cost was time.

In an AI pipeline, the methodology applied at acquisition becomes the evidentiary record. The data gets composed across sources, trained on at population scale, and once it’s in the model, it’s not coming out. Whatever language you used at the start becomes the answer to whatever question gets asked later.

The questions vary by buyer. An internal research team won’t ask much. An enterprise procurement team will ask who assessed the data, against what risk surface, for what recipient class. A government contractor will ask the same in more detail. A regulator or plaintiff’s counsel will ask with documentation requirements.

A buyer asks: “Is this dataset de-identified under HIPAA?”

The Safe Harbor answer is: “Yes, here’s the methodology, applied verbatim.”
The Expert Determination answer is: “Yes, here’s the signed determination, scoped to your use case and recipient class.”
The contractual answer is: “The vendor attests to it under the agreement, here’s the documentation.”
The pseudonymized answer is: “Not directly, the data is keyed, and the assessment of the linked dataset is a separate question.”

Four different answers to the same question. Each is accurate, under its own definition of “de-identified.” None of them are wrong. The problem is that AI teams are often unaware which one applies to the data they just acquired and which one their downstream buyer is going to ask for.

What this looks like in practice

A few practical implications.

If you’re sourcing data and the documentation says “de-identified,” it’s worth asking which of the four methods produced that label. The answer determines what data you actually have, what signal it preserves, and what artifact you can point to downstream.

If your use case stays internal, you have room. Safe Harbor or contractual de-identification can be sufficient, provided you understand what each one is and isn’t doing for your specific data and model context.

If your use case eventually surfaces to an enterprise buyer, a government contractor, or external commercialization, the artifact those reviewers are asking for is closer to Expert Determination. A vendor attestation and a Safe Harbor methodology document don’t answer the question they’re asking.

If your data is the product of a join across pseudonymized sources, the de-identification question has to be re-asked of the joined output. The components aren’t the answer.

If your pipeline refreshes with new data flowing in, new sources composing, new model contexts, the methodology has to refresh with it. A signed determination is dated the day it’s issued. Continuous pipelines need continuous oversight.

The point

Real-world data is the unlock. Methodology is the infrastructure that makes it usable. “De-identified” is a category of decisions that determines what data you can train on, what signal survives, and what your dataset can defend downstream.

The teams moving fastest with regulated data aren’t the ones with the most aggressive privacy posture, and they’re not the ones treating privacy as paperwork to clear. They’re the ones who know which method produced their data, why, and what artifact they’re holding when someone asks.

That’s the infrastructure. Everything else builds on top of it.

Integral Privacy Technologies is the independent privacy layer for AI. We embed in regulated data pipelines, scope and sign Expert Determinations under peer-reviewed methodology, and produce documentation that holds up under enterprise and regulatory examination. useintegral.com/platform/forward-deployed.