The Power of Anonymization
Since GDPR only relates to personal data, any data that is not personal is not covered by the regulation. This means that if you are able to completely remove any personal identifiers from the data, that data is no longer subject to the rules.
This is where anonymization comes in!
Data anonymization is the process of taking the personal data and modifying it in such a way that it can no longer be used to identify an individual. As a result, data that has been properly anonymized can be freely used, shared and transferred without being covered by GDPR or any other legislation.
However, true data anonymization is difficult to achieve, which means many data controllers fail to do it properly and completely.
Furthermore, proper anonymization can destroy the data to such an extent that it no longer has any value. This is due to the requirement to ensure that you aren’t revealing data that can be used in combination with other details to identify an individual.
Understanding how inference can be used to identify individuals from data is difficult and hence many data controllers take a conservative approach and assume almost all the data will need to be changed.
Pseudonymization vs. Anonymization
Because anonymization is so hard, people often try to use pseudonymization to protect personal data. Pseudonymization is the process of directly replacing personal identifiers with some form of random identifier. For instance, you might replace names with random sequences of letter and phone numbers with random numbers. The trouble is that although this may give the illusion of protecting personal data, it doesn’t give strong enough protection against inference attacks.
Inference attacks work by combining data points in order to re-identify an individual with high probability of accuracy. As an example, take a dataset containing health records. You might pseudonymize the data by replacing all the names and healthcare IDs with unique random numbers. However, an attacker may be able to use a combination of other data such as location, medical history, gender and age to make an accurate guess as to the real identity of the person.
In the case of pseudonymization, direct identifiers of a data record are replaced or deleted with pseudonyms – for example, a telephone number could be exchanged with random digits, or a user ID could be stored instead of a plain name. This type of processing preserves much of the value of the data, but is not nearly as secure as anonymization. Therefore, pseudonymous data continue to be considered as personal data within the meaning of the GDPR.
With anonymization, data is changed in a way that a conclusion on a natural persons is not or only possible by a disproportionately high expenditure. Anonymity is for example achieved by aggregating data points into groups or by adding a noise – i.e. incorrect data – to a data set. Anonymized data is not subject to data protection laws.
Why Pseudonymization Isn’t Enough
GDPR Recital 26 makes it clear that pseudonymization is not strong enough to protect personal data. Pseudonymization permits data controllers to handle their data more liberally, but it does not abolish all risks due to the possibility of re-identification.
Recital 26 states that
“The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymization, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.”
Crucially, this means pseudonymous data is still subject to privacy regulations under GDPR.
Recital 26 then distinguishes explicitly between pseudonymized and anonymized data. It explains that pseudonymized data
“…should be considered to be information on an identifiable natural person.” It then states: “The principles of data protection should […] not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
And it makes it clear:
“This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.”
In short, if your data is anonymized it is no longer subject to the GDPR, while if it is simply pseudonymized GDPR still applies.
Anonymization can generally take one of two approaches. In the first approach, called “static anonymization”, you completely anonymize all the data before you query it. This approach is the traditional approach adopted by many companies.
The other approach is dynamic (or sometimes called interactive) anonymization. Here the data is only anonymized as part of the query process. This has the benefit that the analysis will be more accurate/useful, but it is much harder to achieve.
K-Anonymity – A Concept That Is Already Over 20 Years Old
Traditional approaches include k-anonymity. The concept of k-anonymity provides a way to measure the anonymity of data. k is defined such that each person cannot be uniquely identified from least k-1 others. In other words, in a k=5 anonymized dataset, it is impossible to distinguish each person from at least 4 others.
This is achieved by generalising data and replacing it with ranges, etc., until the required level of anonymity is reached. The trouble is, producing optimally k-anonymized datasets is known to be an almost-impossibly hard problem, moreover, it doesn’t even guarantee anonymity since the data can still be cross-referenced with other publically-available datasets to enable re-identification. In other words, it is fundamentally flawed as an approach.
Differential Privacy – New Perspectives on Privacy
One of the best-known dynamic approaches is differential privacy. Here, random noise is added to the results of the query to ensure that it isn’t possible to differentiate one user from another (hence “differential”). The maths behind this is quite complex, and to achieve true differential privacy is hard, hence companies often adopt more relaxed forms such as randomised differential privacy.
Another common misconception is that differential privacy is an algorithm.
Rather, it is a model of privacy.
Essentially, the model has a statistical uncertainty about whether a user exists in a database or not. The idea is that if you can not be sure if someone is in a database, the privacy of that person is protected. Differential Privacy assigns the value epsilon to this uncertainty. A low epsilon (e.g. less than 1) means that there is definitely a high uncertainty. However, higher values of epsilon are less clear: there may or may not be uncertainty depending on other factors, such as external knowledge an analyst has about a person in the database. As described above, mechanisms that use Differential Privacy typically add noise – either to the responses of the queries (dynamically) or to the record itself (static).
Just like the static anonymization method K-Anonymity, Differential Privacy requires the validity of the data. A mechanism with a small epsilon removes almost the entire information value. Moreover, the mathematical proof that a mechanism is actually Differential Privacy requires extensive expertise. Therefore, it is no coincidence that only companies with very high research expenditures use it with a published epsilon.
Privacy Budget: A Limit to Insights
Mechanisms like differential privacy provide strong anonymity, but they are hard to implement and suffer from the issue of “privacy budget”.
The problem is that when you use truly random noise to anonymize your data, every time you query the same data, you reduce the level of anonymization. This is because you are able to use the aggregate results to reconstruct the original data by filtering out the noise. In differential privacy, the value epsilon is used to determine how strict the privacy is. The smaller the value, the better the privacy but the worse the accuracy of any results from analysing the data. Also, the smaller the value of epsilon, the fewer times you can access the data (effectively epsilon is proportional to your privacy budget).
The original authors of differential privacy made the assumption that epsilon would always be set to a value “much less than one.” However, a 2018 paper looking at the issues with deploying differential privacy in the US Census Bureau discovered that practitioners were picking values of epsilon sufficient to give the required accuracy and then were tripling these.
This was to ensure they could repeatedly sample the same data set without the need to re-anonymize it, but on the same time radically reduced the strength of the anonymization.
To retain data utility, data controllers typically combine a complex variety of mechanisms that all provide some anonymity but may not protect against re-identification in all cases.
These include rounding, cell swapping, outlier removal, aggregation, sampling, and others. Getting this right requires substantial expertise and again, if done completely right, this process typically destroys the utility of the data.
The problem comes down to the fact that data sometimes may count as personal and sometimes doesn’t, depending on the nature of the dataset. If you take a conservative approach that guarantees anonymity in all cases, you very likely will end up with no usable data. But if you take a more nuanced approach that preserves useful data you won’t be able to demonstrate that you have preserved anonymity.
Essentially, this means that each time you run an analysis you need to re-assess whether the data released is suitably anonymized. Not only does this become extremely time consuming, but it also carries a good deal of bureaucracy as you have to be able to satisfy the data controller and the competent data authority that you are doing it correctly.
The problems outlined above have given rise to some myths about data anonymization.
Myth #1: Anonymization Destroys Data
The biggest of these is that “Anonymization will always destroy my data”. This myth has arisen because of the tendency for people to take the conservative approach of applying anonymization aggressively and uniformly across all the data. As already pointed out, doing anonymization correctly is a hard problem (indeed, in some cases it may be an insoluble problem). But that does not mean that anonymization has to destroy your data.
Myth #2: Pseudonymization = Anonymization
Another commonly held belief is that pseudonymization is pseudonymous with anonymization. This is partly because in many jurisdictions, anonymization is not clearly defined, or is defined in such a way that it encompasses pseudonymization. However, in the EU, anonymization and pseudonymization are clearly distinguished. Anonymized data is exempted from GDPR but pseudonymized data is not.
Myth #3: Anonymization Is Not Possible
The final common myth is that anonymization is simply not possible. This arises from the fact that people know the issues that anonymization techniques such as k-anonymity have with preventing re-identification. Consequently they assume this means it is never possible to fully anonymize data.
One issue some analysts have is understanding how to use anonymized data in their analysis. It can be easy to fall into the trap of thinking that because the data is anonymized, it will be harder to work with. However, in practice you can often use anonymized data in exactly the same way you use raw data. (If you want to know more about that we also recommend our blog post ‘Can anonymized data still be useful?)
Let’s look at a simple example. Imagine you work for a bank and have been asked to generate a simple analysis. This will show the link between customer deposits, demographics and location. More specifically, the bank wants to know how many of its customers under 35 who bank in a certain branch deposit at least €2,500 a month. The data should also include gender and how much money the customers have in their account. This data will then be used for targeted marketing in that branch.
Using the raw data to generate this would be simple. Find all customers in that branch. Then find the ones that are under 35. Then remove those that don’t reach the required monthly deposit level. Finally use the data about gender and bank balance to generate the resulting plot.
With anonymized data the process is similar. However, you have to be aware of how the anonymization has been done.
For instance, if the customer ages have been binned into decades (<20, 20-30, 30-40, etc.), you won’t be able to identify all the customers under 35, or you will end up including some customers aged 35-40. Equally, if it turns out that only a handful of the customers are male, then you won’t be able to see the gender vs. account balance results for males.
Building a Privacy-preserving analytics stack – better understand how to comply with the requirements imposed by GDPR while still leveraging data analysis.
GDPR is one of the most far-reaching data protection laws anywhere in the world, and as such it has had a huge impact globally. This is because, unlike many national data protection laws, GDPR applies to any company that deals with EU residents, wherever they are in the world. In this section we will look at the specific impact GDPR has had on analytics.
In this section we give you some advice on how to select tools for your stack. As with most things, the right tool for one setting won’t be right in another setting. So, this advice covers the things you should consider when you select a particular tool.