The Power of Anonymisation
Since GDPR only relates to personal data, any data that is not personal is not covered by the regulation. This means that if you are able to completely remove any personal identifiers from the data, that data is no longer subject to the rules.
This is where anonymisation comes in!
Data anonymisation is the process of taking the personal data and modifying it in such a way that it can no longer be used to identify an individual. As a result, data that has been properly anonymised can be freely used, shared and transferred without being covered by GDPR or any other legislation.
However, true data anonymisation is difficult to achieve, which means many data controllers fail to do it properly and completely.
Furthermore, proper anonymisation can destroy the data to such an extent that it no longer has any value. This is due to the requirement to ensure that you aren’t revealing data that can be used in combination with other details to identify an individual.
Understanding how inference can be used to identify individuals from data is difficult and hence many data controllers take a conservative approach and assume almost all the data will need to be changed.
Pseudonymisation vs. Anonymisation
Because anonymisation is so hard, people often try to use pseudonymisation to protect personal data. Pseudonymisation is the process of directly replacing personal identifiers with some form of random identifier. For instance, you might replace names with random sequences of letter and phone numbers with random numbers. The trouble is that although this may give the illusion of protecting personal data, it doesn’t give strong enough protection against inference attacks.
Inference attacks work by combining data points in order to re-identify an individual with high probability of accuracy. As an example, take a dataset containing health records. You might pseudonymise the data by replacing all the names and healthcare IDs with unique random numbers. However, an attacker may be able to use a combination of other data such as location, medical history, gender and age to make an accurate guess as to the real identity of the person.
In the case of pseudonymisation, direct identifiers of a data record are replaced or deleted with pseudonyms – for example, a telephone number could be exchanged with random digits, or a user ID could be stored instead of a plain name. This type of processing preserves much of the value of the data, but is not nearly as secure as anonymisation. Therefore, pseudonymous data continue to be considered as personal data within the meaning of the GDPR.
With anonymisation, data is changed in a way that a conclusion on a natural persons is not or only possible by a disproportionately high expenditure. Anonymity is for example achieved by aggregating data points into groups or by adding a noise – i.e. incorrect data – to a data set. Anonymised data is not subject to data protection laws.
Why Pseudonymisation Isn’t Enough
GDPR Recital 26 makes it clear that pseudonymisation is not strong enough to protect personal data. Pseudonymisation permits data controllers to handle their data more liberally, but it does not abolish all risks due to the possibility of re-identification.
Recital 26 states that
“The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”
Crucially, this means pseudonymous data is still subject to privacy regulations under GDPR.
Recital 26 then distinguishes explicitly between pseudonymised and anonymised data. It explains that pseudonymised data
“…should be considered to be information on an identifiable natural person.” It then states: “The principles of data protection should […] not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
And it makes it clear:
“This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.”
In short, if your data is anonymised it is no longer subject to the GDPR, while if it is simply pseudonymised GDPR still applies.
Anonymisation can generally take one of two approaches. In the first approach, called “static anonymisation”, you completely anonymise all the data before you query it. This approach is the traditional approach adopted by many companies.
The other approach is dynamic (or sometimes called interactive) anonymisation. Here the data is only anonymised as part of the query process. This has the benefit that the analysis will be more accurate/useful, but it is much harder to achieve.
K-Anonymity – A Concept That Is Already Over 20 Years Old
Traditional approaches include k-anonymity. The concept of k-anonymity provides a way to measure the anonymity of data. k is defined such that each person cannot be uniquely identified from least k-1 others. In other words, in a k=5 anonymised dataset, it is impossible to distinguish each person from at least 4 others.
This is achieved by generalising data and replacing it with ranges, etc., until the required level of anonymity is reached. The trouble is, producing optimally k-anonymised datasets is known to be an almost-impossibly hard problem, moreover, it doesn’t even guarantee anonymity since the data can still be cross-referenced with other publically-available datasets to enable re-identification. In other words, it is fundamentally flawed as an approach.
Differential Privacy – New Perspectives on Privacy
One of the best-known dynamic approaches is differential privacy. Here, random noise is added to the results of the query to ensure that it isn’t possible to differentiate one user from another (hence “differential”). The maths behind this is quite complex, and to achieve true differential privacy is hard, hence companies often adopt more relaxed forms such as randomised differential privacy.
Another common misconception is that differential privacy is an algorithm.
Rather, it is a model of privacy.
Essentially, the model has a statistical uncertainty about whether a user exists in a database or not. The idea is that if you can not be sure if someone is in a database, the privacy of that person is protected. Differential Privacy assigns the value epsilon to this uncertainty. A low epsilon (e.g. less than 1) means that there is definitely a high uncertainty. However, higher values of epsilon are less clear: there may or may not be uncertainty depending on other factors, such as external knowledge an analyst has about a person in the database. As described above, mechanisms that use Differential Privacy typically add noise – either to the responses of the queries (dynamically) or to the record itself (static).
Just like the static anonymization method K-Anonymity, Differential Privacy requires the validity of the data. A mechanism with a small epsilon removes almost the entire information value. Moreover, the mathematical proof that a mechanism is actually Differential Privacy requires extensive expertise. Therefore, it is no coincidence that only companies with very high research expenditures use it with a published epsilon.
Privacy Budget: A Limit to Insights
Mechanisms like differential privacy provide strong anonymity, but they are hard to implement and suffer from the issue of “privacy budget”.
The problem is that when you use truly random noise to anonymise your data, every time you query the same data, you reduce the level of anonymisation. This is because you are able to use the aggregate results to reconstruct the original data by filtering out the noise. In differential privacy, the value epsilon is used to determine how strict the privacy is. The smaller the value, the better the privacy but the worse the accuracy of any results from analysing the data. Also, the smaller the value of epsilon, the fewer times you can access the data (effectively epsilon is proportional to your privacy budget).
The original authors of differential privacy made the assumption that epsilon would always be set to a value “much less than one.” However, a 2018 paper looking at the issues with deploying differential privacy in the US Census Bureau discovered that practitioners were picking values of epsilon sufficient to give the required accuracy and then were tripling these.
This was to ensure they could repeatedly sample the same data set without the need to re-anonymise it, but on the same time radically reduced the strength of the anonymisation.
To retain data utility, data controllers typically combine a complex variety of mechanisms that all provide some anonymity but may not protect against re-identification in all cases.
These include rounding, cell swapping, outlier removal, aggregation, sampling, and others. Getting this right requires substantial expertise and again, if done completely right, this process typically destroys the utility of the data.
The problem comes down to the fact that data sometimes may count as personal and sometimes doesn’t, depending on the nature of the dataset. If you take a conservative approach that guarantees anonymity in all cases, you very likely will end up with no usable data. But if you take a more nuanced approach that preserves useful data you won’t be able to demonstrate that you have preserved anonymity.
Essentially, this means that each time you run an analysis you need to re-assess whether the data released is suitably anonymised. Not only does this become extremely time consuming, but it also carries a good deal of bureaucracy as you have to be able to satisfy the data controller and the competent data authority that you are doing it correctly.
The problems outlined above have given rise to some myths about data anonymisation.
Myth #1: Anonymisation Destroys Data
The biggest of these is that “Anonymisation will always destroy my data”. This myth has arisen because of the tendency for people to take the conservative approach of applying anonymisation aggressively and uniformly across all the data. As already pointed out, doing anonymisation correctly is a hard problem (indeed, in some cases it may be an insoluble problem). But that does not mean that anonymisation has to destroy your data.
Myth #2: Pseudonymisation = Anonymisation
Another commonly held belief is that pseudonymisation is pseudonymous with anonymisation. This is partly because in many jurisdictions, anonymisation is not clearly defined, or is defined in such a way that it encompasses pseudonymisation. However, in the EU, anonymisation and pseudonymisation are clearly distinguished. Anonymised data is exempted from GDPR but pseudonymised data is not.
Myth #3: Anonymisation Is Not Possible
The final common myth is that anonymisation is simply not possible. This arises from the fact that people know the issues that anonymisation techniques such as k-anonymity have with preventing re-identification. Consequently they assume this means it is never possible to fully anonymise data.
One issue some analysts have is understanding how to use anonymised data in their analysis. It can be easy to fall into the trap of thinking that because the data is anonymised, it will be harder to work with. However, in practice you can often use anonymised data in exactly the same way you use raw data. (If you want to know more about that we also recommend our blog post ‘Can anonymised data still be useful?)
Let’s look at a simple example. Imagine you work for a bank and have been asked to generate a simple analysis. This will show the link between customer deposits, demographics and location. More specifically, the bank wants to know how many of its customers under 35 who bank in a certain branch deposit at least €2,500 a month. The data should also include gender and how much money the customers have in their account. This data will then be used for targeted marketing in that branch.
Using the raw data to generate this would be simple. Find all customers in that branch. Then find the ones that are under 35. Then remove those that don’t reach the required monthly deposit level. Finally use the data about gender and bank balance to generate the resulting plot.
With anonymised data the process is similar. However, you have to be aware of how the anonymisation has been done.
For instance, if the customer ages have been binned into decades (<20, 20-30, 30-40, etc.), you won’t be able to identify all the customers under 35, or you will end up including some customers aged 35-40. Equally, if it turns out that only a handful of the customers are male, then you won’t be able to see the gender vs. account balance results for males.
Building a Privacy-preserving analytics stack – better understand how to comply with the requirements imposed by GDPR while still leveraging data analysis.
GDPR is one of the most far-reaching data protection laws anywhere in the world, and as such it has had a huge impact globally. This is because, unlike many national data protection laws, GDPR applies to any company that deals with EU residents, wherever they are in the world. In this section we will look at the specific impact GDPR has had on analytics.
In this section we give you some advice on how to select tools for your stack. As with most things, the right tool for one setting won’t be right in another setting. So, this advice covers the things you should consider when you select a particular tool.