Anonymization with Diffix: how it works

Aircloak Insights uses Diffix to retain data quality while achieving a strong level of anonymity.

Illustration showing an analyst inspecting a dashboard

Generating high quality statistics through direct database queries without revealing sensitive personal information from the dataset has been an elusive dream.

Diffix is a new framework for database querying that anonymizes query results by adding noise tailored to the query as well as the underlying dataset. Unlike other anonymization solutions it allows you to simply and safely work with rich datasets regardless of your use case.

Diffix was developed in partnership with the Max Planck Institute for Software Systems after years of research.
It has been confirmed to be fully compliant with European guidelines for anonymization.

“Diffix: High-Utility Database Anonymization” – Research Paper

How Diffix Differs
Compared to other Approaches to Anonymization

Fundamentally there are two main approaches to data anonymization. One is static anonymization, where a data set is anonymized in its entirety before it is being used for analytical purposes. The other is dynamic anonymization (or query-by-query anonymization), where the anonymization is applied to each query individually. Aircloak’s approach to anonymization is a dynamic query-by-query approach.

Aircloak Insights vs Static anonymization

When statically anonymizing a dataset, you have to find the right balance between utility (which means less anonymization) and security and protection (stronger anonymization). Getting this balance right is non-trivial and is the source of the vast majority of known data re-identification breaches.

Statically anonymizing a dataset requires you to determine which columns contain sensitive data ahead of time. Once the sensitive data has been identified, it either needs to be removed or altered, leading to a dataset of reduced quality. This process is complicated by the need to take into account any additional knowledge an attacker might have that could lead to the dataset becoming re-identifiable. The lower the level of trust, the lower the granularity of the anonymized dataset has to be.

Aircloak’s approach to anonymization is dynamic.
Unlike a static approach, Aircloak gives an analyst access to all the underlying data, and dynamically tailors the anonymization to the specific query and the data requested. Diffix understands what data is sensitive under which circumstances, freeing the analyst from error-prone manual configuration. The resulting answer set is fully anonymized and can be freely shared, without worrying about what additional knowledge an analyst might have.

As the data is never anonymized as a whole, the amount of distortion Aircloak Insights applies is minimal.



Dynamic anonymization: Diffix vs Differential Privacy

When dynamic anonymization is used, the anonymization is performed dynamically on the results of a query rather than ahead of time on the whole dataset. This is the approach taken by Aircloak, as well as other well-known approaches such as Differential Privacy. The benefit this has over static anonymization is that only dimensions that are part of the query need to be taken into account. This leads to the ability to perform fine-grained anonymization without the risk of data exposure.

Dynamic anonymization works by marginally altering the values produced from the query. Either the dimensions or the aggregates can be altered. Altering the aggregates themselves is usually done by adding small amounts of statistical noise (for example Gaussian noise). When using a truly random noise value, each subsequent query results in a reduced level of protection. This is the cause of the concept of a privacy or query budget in approaches such as Differential Privacy. Each query consumes a bit of the available budget until it is depleted. When the budget is empty the dataset can no longer be used for analysis.

Aircloak’s Diffix approach has engineered away the need for a privacy budget by producing tailored pseudo-random noise values that do not average away. Repeated or semantically equivalent queries produce the same noise values, which in turn leads to the ability to ask as many queries of your dataset as you desire.

With Diffix your dataset does not expire!

Responding to: “When the signal is in the noise: Exploiting Aircloak’s Diffix”

In April 2018 researchers from Imperial College London and CU Louvain published a vulnerability of Diffix. We have examined the attack, and have found the vulnerability to be minor. We published two statements immediately after the publication of the academic paper:

Report On The Diffix Vulnerability Announced By Imperial College London And CU Louvain
Statement Regarding The Attack On Diffix By Imperial College Scientists

To further ensure that Aircloak Insights and Diffix meet the highest standards of data privacy and anonymization, we support a variety of different projects and initiatives.  Read more about them on our page to Security at Aircloak. 


Ready to see what Aircloak can do for you?