Anonymization with Aircloak: how it works

Aircloak Insights uses a unique anonymization method to retain data quality while achieving a strong level of anonymity.

Illustration showing an analyst inspecting a dashboard

Generating high quality statistics through direct database queries without revealing sensitive personal information from the dataset has been an elusive dream.

Our new framework for database querying anonymizes query results by adding noise tailored to the query as well as the underlying dataset. Unlike other anonymization solutions it allows you to simply and safely analyze rich datasets regardless of your use case.

Aircloak’s anonymization is based on a combination of time-tested ideas like K-Anonymity, low-count suppression, top and bottom coding, and differential privacy noise, as well as patented open concepts developed jointly by Aircloak and Max Planck Institute for Software Systems (MPI-SWS), including Sticky Layered Noise and safe SQL filtering.

Our approach was developed in partnership with the Max Planck Institute for Software Systems after years of research. It has been confirmed to be fully compliant with European guidelines for anonymization.


“Diffix-Birch: Extending Diffix-Aspen” – Research Paper

How Aircloak Differs
Compared to other Privacy Approaches

There are a few approaches that enable you to safely work and analyze sensitive data. Choosing the right approach for your project is key, since every use case has different requirements.

In our blog article A Visual Comparison for Privacy Approaches in Data Analytics we concisely explain the strengths and weaknesses of each approach and compare them with regard to analytical quality, time-to-market and privacy / machine learning / test data capabilities.

For further reading:

The 7 Myths of Data Anonymization
Explaining Differential Privacy in 3 Levels of Difficulty
Differences between Static and Interactive Anonymization
Aircloak Whitepaper – Data Anonymization in Digital Business Models

Aircloak Insights vs Static anonymization

When statically anonymizing a dataset, you have to find the right balance between utility (which means less anonymization) and security and protection (stronger anonymization). Getting this balance right is non-trivial and is the source of the vast majority of known data re-identification breaches.

Statically anonymizing a dataset requires you to determine which columns contain sensitive data ahead of time. Once the sensitive data has been identified, it either needs to be removed or altered, leading to a dataset of reduced quality. This process is complicated by the need to take into account any additional knowledge an attacker might have that could lead to the dataset becoming re-identifiable. The lower the level of trust, the lower the granularity of the anonymized dataset has to be.

Aircloak’s approach to anonymization is dynamic.
Unlike a static approach, Aircloak gives an analyst access to all the underlying data, and dynamically tailors the anonymization to the specific query and the data requested. The system understands what data is sensitive under which circumstances, freeing the analyst from error-prone manual configuration. The resulting answer set is fully anonymized and can be freely shared, without worrying about what additional knowledge an analyst might have.

As the data is never anonymized as a whole, the amount of distortion Aircloak Insights applies is minimal.

 


 

Dynamic anonymization: Aircloak Insights vs Differential Privacy

When dynamic anonymization is used, the anonymization is performed dynamically on the results of a query rather than ahead of time on the whole dataset. This is the approach taken by Aircloak, as well as other well-known approaches such as Differential Privacy. The benefit this has over static anonymization is that only dimensions that are part of the query need to be taken into account. This leads to the ability to perform fine-grained anonymization without the risk of data exposure.

Dynamic anonymization works by marginally altering the values produced from the query. Either the dimensions or the aggregates can be altered. Altering the aggregates themselves is usually done by adding small amounts of statistical noise (for example Gaussian noise). When using a truly random noise value, each subsequent query results in a reduced level of protection. This is the cause of the concept of a privacy or query budget in approaches such as Differential Privacy. Each query consumes a bit of the available budget until it is depleted. When the budget is empty the dataset can no longer be used for analysis.

Aircloak’s approach has engineered away the need for a privacy budget by producing tailored pseudo-random noise values that do not average away. Repeated or semantically equivalent queries produce the same noise values, which in turn leads to the ability to ask as many queries of your dataset as you desire.

With Aircloak your dataset does not expire!