Diffix Vulnerability #2
Discovered |
April 2018 |
Demonstrated |
—– |
Announced |
April 2018 |
Severity |
Very Low |
Expected Patch Date |
—– |
This attack was discovered by Andrea Gadotti, Florimond Houssiau, Luc Rocher, and Yves-Alexandre de Montjoye of Imperial College London. The attack is documented at https://arxiv.org/abs/1804.06752. The attack has never been demonstrated to work in practice. The attack has been implemented by researchers at the Max Planck Institute for Software Systems (MPI-SWS), but has so far failed to work.
Update [August 2019]: The authors disclosed the attack again in August 2019 in https://www.usenix.org/system/files/sec19-gadotti_0.pdf. The authors were able to make a synthetic dataset custom designed so that a given column happens to be attackable for every user. They demonstrated 92% accuracy on a column where a random guess would yield 50% accuracy. The attack was not effective against any of the four real datasets that the authors studied.
Goal:
The attack is a singling out attack, whereby the attacker makes a claim of the sort “There is a single user with the following attributes.”
Prerequisites:
The attacker must know the complete contents of roughly N=5 columns. The attacker is trying to learn an unknown value in an N+1th column. Within the N known columns, there must be a set of values whereby all N values uniquely identify a user, but where any N-1 values in combination with the unknown value identify enough users to avoid Low Count Filtering in the Aircloak system. (The conditions are rare. In measurements carried out by MPI-SWS, roughly 1 in 75 million rows satisfied the conditions.)
Attack:
The attack generates a small set of queries whereby each answer has a different set of users (those defined by any N-1 values), and therefore a different noise sample. One set of queries does not contain the victim user, and the other set of queries either contains or does not contain the victim user, depending on whether the victim has or does not have the unknown column value. Because each query has a different noise value, the attack can take the average of each set to improve the confidence in the exact value. By taking the difference in counts between the set without the victim, and the set that may or may not contain the victim, the attacker can deduce whether the victim is in the second set, and therefore whether the victim has the unknown value or not.
Effectiveness:
Both Aircloak and the authors have found the attack to be ineffective against real datasets. The authors of the attack demonstrated it only against custom-designed dataset which does not exist in practice with very high probability.
Patch:
Future planned improvements in anonymization will make the attack impossible. These improvements will detect when any given condition pertains to only a few users or less, and ignore the condition.