Anonymisation with Diffix: how it works
Aircloak Insights uses Diffix to retain data quality while achieving a strong level of anonymity.
Generating high quality statistics through direct database queries without revealing sensitive personal information from the dataset has been an elusive dream.
Diffix is a new framework for database querying that anonymises query results by adding noise tailored to the query as well as the underlying dataset. Unlike other anonymisation solutions it allows you to simply and safely work with rich datasets regardless of your use case.
Diffix was developed in partnership with the Max Planck Institute for Software Systems after years of research.
It has been confirmed to be fully compliant with European guidelines for anonymisation.
How Diffix Differs
Compared to other approaches to anonymisation
Fundamentally there are two main approaches to data anonymisation. One is static anonymisation, where a data set is anonymised in its entirety before it is being used for analytical purposes. The other is dynamic anonymisation (or query-by-query anonymisation), where the anonymisation is applied to each query individually. Aircloak’s approach to anonymisation is a dynamic query-by-query approach.
Aircloak Insights vs Static anonymisation
When statically anonymising a dataset, you have to find the right balance between utility (which means less anonymisation) and security and protection (stronger anonymisation). Getting this balance right is non-trivial and is the source of the vast majority of known data re-identification breaches.
Statically anonymising a dataset requires you to determine which columns contain sensitive data ahead of time. Once the sensitive data has been identified, it either needs to be removed or altered, leading to a dataset of reduced quality. This process is complicated by the need to take into account any additional knowledge an attacker might have that could lead to the dataset becoming re-identifiable. The lower the level of trust, the lower the granularity of the anonymised dataset has to be.
Aircloak’s approach to anonymisation is dynamic.
Unlike a static approach, Aircloak gives an analyst access to all the underlying data, and dynamically tailors the anonymisation to the specific query and the data requested. Diffix understands what data is sensitive under which circumstances, freeing the analyst from error-prone manual configuration. The resulting answer set is fully anonymised and can be freely shared, without worrying about what additional knowledge an analyst might have.
As the data is never anonymised as a whole, the amount of distortion Aircloak Insights applies is minimal.
Dynamic anonymisation: Diffix vs Differential Privacy
When dynamic anonymisation is used, the anonymisation is performed dynamically on the results of a query rather than ahead of time on the whole dataset. This is the approach taken by Aircloak, as well as other well-known approaches such as Differential Privacy. The benefit this has over static anonymisation is that only dimensions that are part of the query need to be taken into account. This leads to the ability to perform fine-grained anonymisation without the risk of data exposure.
Dynamic anonymisation works by marginally altering the values produced from the query. Either the dimensions or the aggregates can be altered. Altering the aggregates themselves is usually done by adding small amounts of statistical noise (for example Gaussian noise). When using a truly random noise value, each subsequent query results in a reduced level of protection. This is the cause of the concept of a privacy or query budget in approaches such as Differential Privacy. Each query consumes a bit of the available budget until it is depleted. When the budget is empty the dataset can no longer be used for analysis.
Aircloak’s Diffix approach has engineered away the need for a privacy budget by producing tailored pseudo-random noise values that do not average away. Repeated or semantically equivalent queries produce the same noise values, which in turn leads to the ability to ask as many queries of your dataset as you desire.
With Diffix your dataset does not expire!