Part VI
How Aircloak Insights Can Help
Aircloak Insights is the first solution to offer real-time database anonymization that allows analysts to query anonymized data exactly as if it were the original raw data.
In this section we will explain how Aircloak’s technology works and show why it is the first GDPR-compliant tool for database anonymization.
Strong Anonymization for Your Data Analytics
Aircloak Insights is a transparent proxy that sits between analysts and the data they need to work with. Because it is transparent, analysts are able to query the data as if they were doing it directly. They are able to construct queries in SQL or create dashboards using tools like Tableau. Aircloak Insights will intercept this query and will convert it to a suitable form for the backend (which may be a structured SQL database, or a NoSQL data lake). As results are returned, Aircloak Insights ensures they are suitably aggregates to ensure full anonymity.
The huge benefit of Aircloak Insights is that it allows analysts to use the tools they are familiar without any need for specific expertise in anonymization.
It is also compatible with all the most common databases, both SQL and NoSQL. This means it can be directly integrated into your existing analytics stack without the need for any major modifications.
- Microsoft SQL Server, versions 2012 R2 and newer
- MongoDB, versions 3.4 and newer
- MySQL, version 5 and newer, and MariaDB, version 10.1 and newer
- PostgreSQL, version 9.1 and newer
- SAP HANA, version 2.0 and newer
- SAP IQ, version 16.0 and newer
- Apache Drill, version 1.13 and newer
Please note that with Apache Drill you can query a multitude of data sources. Out of the box it provides support for big data environments such as MapR-DB and HBase, as well as for querying CSV, JSON, log, and Parquet files stored on disk, on HDFS, or in AWS S3.
Works with Your Existing SQL and NoSQL Databases
Aircloak Insights is installed in front of and connects to your existing database servers.
You do not need to make any changes to your databases or their schemas. This has many benefits: Firstly the time it takes to go from an idea to a working deployment is minimal. Secondly you do not need to create, maintain and safe-guard additional copies of your data.
How Does Aircloak Insights Work?
Instead of anonymizing or pseudonymising the source data prior to analysis, Aircloak Insights still performs analytics on the unprocessed data. The analytics output is then instantly and automatically anonymized as queries are run. This avoids the risk that the anonymization might destroy or reduce the utility of the data. This gives the strongest possible level of anonymity while retaining the accuracy of the results.
The key to this design is the way that the analytics and anonymization steps are performed simultaneously.
This approach avoids the usual problems faced when doing data anonymization. If the data is anonymized before it is analysed, you may well lose much of the utility in the data. But if the data is anonymized after the analysis you don’t have sufficient knowledge of the underlying data to know whether the anonymization is strong enough.
Aircloak Insights consists of two discrete modules that work together. Both components are deployed as standard Docker containers and can run equally well on bare metal servers or your own virtual server infrastructure.
Insights Air
Insights Air is a web-based control centre that ships as an integral part of Aircloak Insights. Insights Air offers you full granular control over who sees your data and provides all the necessary data security and audit functions needed to ensure you are compliant with any data protection legislation. Insights Air incorporates LDAP and provides full authentication and authorisation functionality to verify the identity of analysts are ensure they are only accessing datasets they are allowed to.
Administrators also get full visibility of the system health and a detailed audit log tracking all queries and system actions. One of the best features is the live query tracker, which allows you to see exactly what queries are currently being run, who is running them, and what impact they are having on your system resources.
Insights Cloak
Insights Cloak handles the actual queries, passing them to the data storage and anonymizing the results as they come back. It is able to do the anonymization properly because it understands the actual data. This means it can apply the most suitable anonymization to the results before passing these back to the analyst.
Insights Cloak is designed so that it can be deployed within the secure perimeter round your data sources. This means that no personal data ever passes the secure perimeter. In turn this means that Insights Cloak doesn’t open up any new attack vectors, reducing the risk compared with other solutions.
The Technology Behind Aircloak Insights
Aircloak Insights uses an innovative approach to ensure that any data that is released has been suitably anonymized. It works by adding pseudo-random noise that is tailored to the query as well as the underlying dataset. It also includes other protection mechanisms such as outlier suppression and low-count filtering.
The technology was developed in collaboration with the Max Planck Institute for Software Systems. It is the culmination of years of research in the field of data privacy and anonymization.
As we saw in the Anonymization section, there are two approaches to anonymization. The first approach is to completely anonymize the dataset before you query it. This gives a strong anonymity but carries a risk of degrading your results so much they are useless. The other approach is dynamic anonymization where you anonymize the data on a query-by-query basis.
The Aircloak approach is dynamic, but it has been designed to ensure privacy in all the query functions. It uses custom implementations of standard analytics primitives such as AVERAGE, SUM and COUNT that are anonymization-aware by design. These take account of the scope of the query semantics and will ignore irrelevant information from the dataset when anonymizing the data. The anonymized results are then returned to the analyst. Because the anonymization is just being run on the relevant data, the anonymization only needs to take into account those dimensions that are part of the query itself. This allows fine grained anonymization while still removing the risks of data exposure.
How Does This Approach Gives GDPR Compliance?
Dynamic anonymization works by marginally altering the values produced from the query. You can either alter the dimensions of the data or more often the aggregate values. Generally, aggregates are altered by the addition of small amounts of random noise (Gaussian or some other random noise source). The problem is that truly random noise leads to the concept of the privacy budget in approaches like Differential Privacy. Once you have exhausted the privacy budget, that dataset can no longer be used for analysis.
Aircloak takes a different approach. Rather than applying truly random noise, it uses pseudo-random layers of noise. This means that every time a query is run, the noise added to it is the same, so you can run the query as often as you want without increasing the risk of leaking the original data. Not only that, but queries that are different, but semantically equivalent also return the same data with the same noise applied. This removes the risk of revealing the data by repeatedly querying it in different ways. This means that the data will remain anonymous and hence this approach is GDPR compliant.
What Other Benefits Does This Approach Bring
This new approach to anonymization should help you streamline the process or running analytics. Chief among these is the fact that any suitable team member can run analytics over the whole dataset without the need for specific per-use-case authorizations. This also applies to suitably vetted external contractors or business divisions that are usually separated by Chinese Walls.
This means all organisations who store valuable data, such as health, location, or financial data, can now instantly and safely monetise the data regardless of its format.
Finally, the provision of full audit trails reduces the cost of mandatory reporting and ensures compliance with the relevant parts of GDPR.
Building a Privacy-preserving analytics stack – better understand how to comply with the requirements imposed by GDPR while still leveraging data analysis.
Part V: How Do I Choose the Best Tools for My Project?
In this section we give you some advice on how to select tools for your stack.