Anonymisation with Diffix: how it works
Using Diffix Aircloak Insights retains data quality while achieving a strong level of anonymity.
Getting high quality statistics through direct database queries without revealing information specific to the individuals in the dataset has been an elusive dream.
Diffix is a new framework for database querying that anonymises query results by adding noise tailored to the query as well as the underlying dataset. Unlike other anonymisation solutions it allows you to simply and safely work with rich datasets regardless of your use case.
Diffix was developed in partnership with the Max Planck Institute for Software Systems after years of research.
It has been confirmed to be fully compliant with European guidelines for anonymisation.
How Diffix Differs
Compared to other approaches to anonymisation
Fundamentally there are two main approaches to data anonymisation. One is static anonymisation, where a data set is anonymised in it’s entirety before it is being used for analytical purposes, and the other is dynamic anonymisation (or query-by-query anonymisation), where the anonymisation is applied to each query individually. Aircloak’s approach to anonymisation is a dynamic query-by-query approach.
Aircloak Insights vs Static anonymisation
When statically anonymising a dataset you have to find the right balance between utility (less anonymisation) and security and protection (stronger anonymisation). Getting this balance right is non-trivial and the source of the vast majority of known data re-identification breaches.
Statically anonymising a dataset requires you to determine which columns contain sensitive data ahead of time. Once the sensitive data has been identified it either needs to be removed or altered, leading to a dataset of reduced quality. This process is complicated by the need to take into account any additional knowledge an attacker might have that could lead to the dataset becoming re-identifiable. The lower the level of trust, the lower the granularity in the anonymised dataset has to be.
Aircloak’s approach to anonymisation is dynamic.
Unlike a static approach, Aircloak gives an analyst access to all the underlying data and subsequently tailors the anonymisation to the specific query and the data requested. Diffix understands what data is sensitive under which circumstances, freeing the analyst from error prone manual configuration. The resulting answer set is fully anonymised and can be freely shared, without consideration of what additional knowledge an analyst might have.
As the data is never anonymised as a whole, the amount of distortion Aircloak Insights need to apply can be kept minimal.
Dynamic anonymisation: Diffix vs Differential Privacy
When dynamic anonymisation is used, the anonymisation is performed dynamically on the results of a query at query time rather than on the dataset as a whole ahead of time. This is the approach taken by Aircloak, as well as other well known approaches such as Differential Privacy. The benefit this has over static anonymisation is that only dimensions that are part of the query need to be taken into account. This leads to the ability to perform fine-grained anonymisation – all without the risk of data exposure.
Dynamic anonymisation slightly alters the values produced. Either the dimensions or the aggregates can be altered. Altering the aggregates themselves in the form of adding slight amounts of statistical noise (for example gaussian noise) is the common approach. When using a truly random noise value, each subsequent query results in a reduced level of protection. This is the cause of the concept of a privacy or query budget in approaches such as Differential Privacy. Each query consumes a bit of the available budget until it is depleted. When the budget is empty the dataset can no longer be used for analysis.
Aircloak’s Diffix approach has engineered away the need for a privacy budget by producing tailored random noise values that do not average away. Repeated or semantically equivalent queries produce the same noise values, which in turn leads to the ability to ask as many queries of your dataset as you desire.
With Diffix your dataset does not expire!