On April 18 researchers from the Imperial College London and CU Louvain published a vulnerability of Diffix. Diffix is the anonymization algorithm jointly designed by the Max Planck Institute for Software Systems (MPI-SWS, www.mpi-sws.org) and Aircloak GmbH (www.aircloak.com), which markets an anonymization product based on Diffix. The vulnerability was posted on the academic paper archival system arXiv here.
A couple hours after the publication, one of the authors (Yves-Alexandre Montjoye) informed Aircloak of the vulnerability.
On April 24, Montjoye tweeted about the vulnerability, and the following day Imperial College London retweeted it. Montjoye’s tweet referred to a blog posting about the vulnerability. Both the paper and the blog characterize the vulnerability to be a serious one.
From the paper:
“We believe this vulnerability to be serious, allowing us to accurately infer private information of users with little background knowledge.”
From the blog:
“We developed a powerful attack allowing an attacker to potentially infer a user’s attribute with high accuracy with very limited auxiliary information.”
Given these and other statements, Aircloak’s customers and others interested in Aircloak may justifiably be concerned about the strength of Diffix’ anonymity. In fact, the researchers never actually implemented and tested the attack, and so in spite of the scary language, the researchers had no definitive evidence that the attack is actually effective.
We (MPI-SWS and Aircloak) have examined the attack, and have found the vulnerability to be minor. The attack can in principle work (and for this the researchers get credit for finding a new attack) but it requires a set of conditions that very infrequently exist in practice. Furthermore, a Diffix improvement that was already in development before the vulnerability was reported addresses the attack fully.
Now we briefly describe the basic idea of the attack through an example (taken from the paper). Suppose that an attacker knows that a user (the victim) in a database is uniquely identified by zip=48828 and age=37. In other words, there is nobody else in the database with this zip and age. Suppose further that the analyst wants to learn whether the victim has been diagnosed as HIV positive (HIV=1). The attacker then makes the following pairs of queries:
- Q1.1: SELECT count(*) FROM table WHERE HIV=1 AND zip=48828
- Q1.2: SELECT count(*) FROM table WHERE HIV=1 AND zip=48828 AND age != 37
- Q2.1: SELECT count(*) FROM table WHERE HIV=1 AND age=37
- Q2.2: SELECT count(*) FROM table WHERE HIV=1 AND age=37 AND zip!=48828
followed by two more pairs (Q3.1, Q3.2, and Q4.1, Q4.2) that are identical to these except with HIV=0 instead of HIV=1. The difference in the counts for each pair is either 0 or 1, depending on whether the victim is HIV positive or not. Each count has a little Gaussian noise added. With only a single pair, the attacker’s confidence that the true answer is 0 or 1 improves only a little over a statistical guess. With 4 such pairs (i.e. 4 random Gaussian samples), however, the attacker’s confidence improves substantially.
The above example has three attributes (the unknown attribute HIV, and the known attributes age and zip). More attributes would increase the number of random samples and therefore the confidence even more. The paper estimates that 5 attributes would give the attacker better than 99% confidence.
For the attack to succeed, the following conditions have to hold:
- The set of known attributes uniquely identify the user,
- The removal or negation of each individual attribute still identifies a sufficiently large number of users (where sufficiently large is 6 or more), and
- The attacker knows with high confidence that conditions 1 and 2 hold.
For the attacker to have high confidence, the attacker needs to know all or nearly all of the values of the columns containing the known attribute. This could well happen for instance if some data set is divided into public and private data. Nevertheless, we want to be clear that it is not enough to simply know some attributes about one individual but not know those same attributes about most of the people in the database.
The second condition is needed because if answers that involve fewer than 6 users have a chance of being suppressed, that chance growing with increased users (we call this low-count suppression). It is ok if not all of the attributes meet condition 2, but a sufficient number of them must.
So the important question is: how often do conditions 1 and 2 hold in real databases. To get a sense of this, we searched for these conditions in the set of real databases that are used for the Aircloak bounty program. These databases vary widely in their characteristics, ranging from census data with 10’s of attributes and one row per user, to banking transaction data with 10’s of rows per user.
Our methodology is to explore the database by randomly selecting attributes and attribute values, checking to see if there is a single user that is isolated by the set of attributes (first condition), and if so, checking to see how many random samples can be produced from the subsequent attribute negations (second condition). We ignored attributes where most values uniquely identify the user (like taxi start time). The procedure works in order of fewer to greater numbers of attributes. Unlike the example above, where the unknown attribute had only two possible values, in our exploration the unknown attribute can have more values. Our exploration is only for query pairs that match the target unknown value.
This methodology gives a good approximate answer to the following question: “If an attacker wants to learn a specific thing about a specific user in the database, the what are the chances that the conditions exist to launch the attack?” For more details on how our measure works, the measurement code can be found here.
The key results of the measurement are as follows:
- Of approximately 150,000,000 measured attribute combinations, 2 provided the attacker with 4 attackable attributes, 143 provided the attacker with 3 attributes, 1076 provided the attacker with 2 attributes, 1474 provided the attacker with 1 attribute, and the remainder provided none at all. Allowing that 4 attributes would give the attacker a relatively high confidence as to the value of the unknown attribute, only 1 in roughly 75 million known attribute combinations would be attackable.
- The best case (for the attacker) is the census database, which has 112 attributes, most of them numeric, and many with a relatively small number of possible values. Here we found 67 combinations that provided 3 attributes out of 337K tested combinations. (None provided 4 or more.)
- The worst case is the scihub database, which is a log of downloads from the scihub academic paper repository and which has 8 columns, most of which have a large number of distinct values. Here, of 69 million tested combinations, 4 provided 2 attributes. (None provided 2 or more.)
These results show that the chances of attacker being able learn an attribute about a given victim with high confidence is extremely low. For the case where an attacker doesn’t care who he or she learns about, but rather simply wants to collect as much unknown information as possible, the results also suggest that, even given full knowledge of several database columns, an attacker would not be able to learn about very many users.
Why are attackable combinations so rare? We can perhaps shed light on this question by considering a couple of extreme cases.
First, suppose that all attributes have only two values, each with 50% probability. Say there are a million users in the database. In this case, typically 20 attributes are required to uniquely define a user. (Each attribute cuts the population in half, and 20 such cuts reduces the population to 1 in expectation.) While 20 attributes gives the attacker a lot of samples, the problem here is that negating one attribute still defines one user. To see this, imagine that the attribute is gender. If one male is defined by 19 attributes plus *gender=male, *then roughly one female is defined by 19 attributes plus *gender!=male. *Low-count suppression would cause the attack to fail.
Now, take the other extreme. Suppose that, for all attributes, every attribute value uniquely defines a single user. At this extreme, the attack can’t be formulated because the query would have only one attribute.
Now suppose a middle ground: a database has 6.25 million users, and that each attribute has 50 values uniformly distributed among the users. In this case, 4 attributes will typically uniquely define a user, and 3 attributes would define a group of 50 users on average. An attack query with three positive attributes (two for the known attributes, one for the unknown), and one negative attribute would define a group of 49 users on average and would therefore avoid low-count suppression. Under these conditions the attack works.
Our measurements show, however, that this middle ground apparently very rarely exists in practice, at least not for the real and varied set of databases we measured.
As mentioned above, Aircloak was in any event already, for other reasons, implementing a new anonymization mechanism that would invalidate this attack under any circumstances. Given this fact, combined with the apparent low risk of the attack, we do not feel a need to explore this attack further.
Nevertheless we encourage the Imperial College London and CU Louvain team to further explore the attack if they believe that our measurements or analysis are incorrect. If they do so we encourage them to make use of Aircloak’s bounty program and demonstrate the attack on a real system.