The European Union’s General Data Protection Regulation (GDPR) states that anonymous data is not personal data, and therefore does not fall under the purview of GDPR. The GDPR, in Recital 26, defines anonymous as:
“…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
The GDPR also requires certification programs, including for certifying anonymity. However, to date no organizations have done this for anonymity, nor to my knowledge does any have a clear plan on how to do so.
This creates a problem for the providers and users of anonymization technologies. How can an anonymization provider legitimately claim that its technology meets the anonymity standard for GDPR? How can a data controller be confident that the data it releases is anonymous enough? (In the remainder of this article, the term anonymous is understood to mean by GDPR standards.)
Aircloak builds and markets an anonymization product based on Diffix, an anonymization algorithm jointly developed by Aircloak and the Max Planck Institute for Software Systems (MPI-SWS). Since the beginning of the Aircloak/MPI-SWS collaboration, we have been struggling with these questions. Early on, we obtainied a certification from TÜViT, a private company that certifies data protection products and services. We also worked with CNIL, the French DPA. In both cases we were successful in so far as we received essentially the strongest endorsement that each organization offers.
The experience, however, convinced us that no certification organization or DPA is really in a position to assert with high confidence that Diffix, or for that matter any complex anonymization technology, is anonymous. These organizations either don’t have the expertise, or they don’t have the time and resources to devote to the problem.
A couple of years ago, we realized that we are really on our own with respect to this problem. It is up to us to find ways to build assurance, both for ourselves and our users, that Diffix complies with GDPR anonymity. This article describes our approach.
The four pillars of GDPR compliance assurance
There are four key aspects to GDPR compliance assurance.
1. Use strong criteria for anonymity
2. Establish and maintain a “no known attacks” stance
3. Full transparency
4. Encourage active public oversight
It is important to point out that a future DPA certification program is not a replacement for these pillars. A certification program is limited in what it can do. It is likely to consist of a set of tests that the anonymization technology needs to pass. The set of tests is unlikely to be complete, and so these pillars must still be in place to make up for limitations in certification programs, and indeed to help improve certification programs.
In evaluating Diffix, we use the anonymization criteria set forth in the Article 29 Data Protection Working Party Opinion 05/2014 on Anonymisation Techniques. The Article 29 opinion describes three core criteria for anonymization, and gives examples of how a variety of anonymization mechanisms satisfy or fail to satisfy to these criteria.
Although five years old now, the set of anonymization mechanisms discussed is still pretty much up to date. The only substantial new technology to appear in these five years is Diffix. While it would be helpful for an updated opinion to include Diffix, we have substantial documentation on how Diffix applies to the three criteria and so an update, at least in this regard, is not strictly necessary.
The three criteria are:
- Singling-out: which corresponds to the possibility to isolate some or all records which identify an individual in the dataset
- Linkability: which is the ability to link, at least, two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases)
- Inference: which is the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes
It is critical to note what is not included in these criteria. It is not necessary to establish the Personally Identifying Information (PII) of the individual to violate these criteria. Examples of PII are names and addresses.
To give an example, suppose there is a geo-location dataset that contains records of times and locations (longitude and latitude) of individuals. Suppose that there is a single record containing a given time and location: no other record have the same time and location. Merely identifying that there is a single such record violates the singling-out criteria. It is not necessary to name who the person for this record is.
This may seem counter-intuitive. After all, if you don’t know who this record refers to, then how can it be that that person’s anonymity is broken? The problem is that it may well be that someone somewhere can name the person given this location information. Perhaps they saw that person there at that time and recognize who it is. The Article 29 criteria cannot presume what the person viewing the records knows or does not know. Rather, the criteria are based only on the data itself.
This feature makes the criteria quite strong.
The criteria in Article 29 are informally stated. The MPI-SWS has designed concrete numeric measures for each of the criteria. The measure, called the GDA attack score (General Data Anonymity), is based on the results of actual attacks on implementations of the anonymity technique.
“No known attacks” stance
In conversations with privacy colleagues, I often hear the following opinion:
“There is no such thing as zero risk. Therefore all evaluations of anonymization must be risk-based.”
Of course it is true that there is strictly speaking no such thing as zero risk. But a risk-based evaluation is generally needed when attacks on a system are known. For instance, a risk-based evaluation on an anonymization method might conclude that there is a 1% re-identification risk due to a known attack, and therefore the anonymized data should only be shared with trusted partners under a contractual obligation to not attempt that re-identification.
In thinking about GDPR compliance assurance, we find it very useful to categorize anonymization technologies into the following:
1. No possible attack (logical or mathematical proof).
2. No known attacks.
3. Known attacks.
An example of the first category is Differential Privacy (in fact the only example of which I am aware). If a Differentially Private mechanism is operated with a low epsilon (say less than 1.0) and has a validated proof and no strong assumptions, then it is reasonable to say that there is no possible attack regardless of what the attacker may know or how much computational power the attack has. In this case, there is no need to consider risk: other than the always-present risk of incorrect implementations, faulty configuration (too high epsilon), or incorrect mathematical proofs, there is nothing an attacker can do to break anonymity.
Unfortunately, Differential Privacy rarely gives adequate analytic utility, and so is rarely usable in practice.
Far more common are de-identification technologies for which there are known attacks. Pseudonymization, whereby PII is removed from the data, is a case in point. When there are known attacks, then a risk-based approach must be taken to mitigate the possibility that these attacks will occur in practice. To be clear, a risk-based approach can be very effective. Careful partnering, legal restrictions, and other safeguards can act as a genuine deterrent and render the use of the data quite safe. Nevertheless, such data cannot be regarded as anonymous, and should remain subject to the GDPR.
With Diffix we strive for the middle ground, no known attacks. We argue that, if there are no known attacks, and a strong effort has been made to find attacks, then it not necessary to do a risk-based evaluation. Indeed, it is hard to see how one could do a risk-based evaluation since if there are no known attacks, then there are also no known risks.
Although a no-known-attacks stance is new within data anonymization, it is the norm in almost all other aspects of privacy and security. To take an example, the Advanced Encryption Standard (AES), which is the technical basis for much if not most encryption of data in transit, operates on a no-known-attacks basis. There is no proof that AES is secure. Rather, there is a long history of attacks and defenses leading to the current technology.
There are two ways to get to the point of no-known-attacks. One is willful ignorance: to simply not look for attacks, and not give others a chance to look for attacks by keeping the technology secret. In other words, security by obscurity. Obviously we are not interested in this approach.
The other is to make a serious concerted effort to find attacks, and to be as open as possible about the technology so that others may find attacks.
From the beginning of the Aircloak/MPI-SWS collaboration, we have worked hard to find weaknesses in our systems. One could roughly characterize the division of labor as MPI-SWS constantly looking for attacks, and Aircloak constantly working to ensure that the technology is usable by customers. (Though Aircloak certainly also looks for and have found a number of attacks.) In fact, Diffix is the third generation of anonymization technology developed by the Aircloak/MPI-SWS collaboration, the first two having been found not strong enough.
Beyond this, Diffix is fully transparent. We publish descriptions of Diffix openly on ArXiv, and update the descriptions in a timely manner.
Encourage active public oversight
Just because a technology is open does not necessarily mean that anybody will pay attention to it. In fact there is a chicken-and-egg problem here. A new security technology is unlikely to receive much attention until it is widely deployed, but once widely deployed, the flaws are more serious. Of course it is also possible to get wide deployment while still staying under the radar of the public simply by being quiet about the number of deployments.
In order to encourage active public oversight, we run a bounty program for anonymity. We are the first (and still only) organization to do so. We ran the first round of the program from the period of Dec. 2017 to May 2018. Aircloak put up a total purse of $15,000. Thirty teams or individuals signed up to attack the system. Two teams were successful and received $10,000 in total. Fixes were implemented and deployed during 2018, after which the details of the attacks were published. We plan to run the second round in March of 2019.
In addition, one academic group published an attack on Diffix in 2018, though the attack was only theoretical in that it was never executed. In practice the attack appears to be ineffectual.
During this same period, MPI-SWS also discovered a new attack. The fix for this attack is in place, and the attack will be published in early 2019.
The conclusion from this is that the bounty program is working. It attracted attackers that otherwise would not have given Diffix any attention. Flaws were discovered and fixed, and Diffix is more secure as a result.
During 2018 there were periods of time when there were known attacks—at least known privately. As such, it was appropriate to do a risk-based analysis of our customer base at the time. In this particular case, risk was very low, in part because the attacks required substantial effort to obtain only limited information, and in part because our customers had given access only to trusted partners.
Moving forward, it will be the case that there are periods where there are no known attacks, periods where there are privately known attacks, and possibly periods where there are publicly known attacks. This is the case with virtually any security technology. We believe that, so long as the normal case that there are no publicly known attacks, and that privately known attacks are fixed quickly, Diffix can be considered anonymous.
EU member body DPA certification programs for anonymization are not in place, and we don’t expect to see effective certification programs for the foreseeable future. It is therefore incumbent upon providers of anonymization technology to actively work towards GDPR compliance assurance.
For Aircloak and MPI-SWS, the four pillars of GDPR compliance assurance are:
1. Strong criteria
2. No known attacks stance
3. Full transparency
4. Encourage public oversight
We encourage Data Protection Officers to only work with anonymization technology providers that adhere to the four pillars. To assist other providers in doing so, MPI-SWS has started the Open GDA Score Project. This project provides tools for measuring anonymity according to the three Article 29 criteria, and serves as a repository for the resulting scores. It will also provide guidance on how to run bounty programs on anonymity.