Diffix Vulnerability #3

Back to Overview


May 2018


May 2018


October 2018



Patched Version


Patched Date

July 2018

This attack was discovered by Apostolos Pyrgelis and Emiliano De Cristofaro of UCL, and Carmela Troncoso of EPFL as part of the first Aircloak Challenge (Dec 2017 – May 2018). The attack is described in https://www.benthamsgaze.org/2018/10/02/on-location-time-and-membership-studying-how-aggregate-location-data-can-harm-users-privacy/. The attack was successfully demonstrated on the Aircloak challenge system.

The attack is a linkability attack, whereby the attacker makes a claim of the sort “This user in a known dataset is also in an unknown dataset.” In the case of the demonstrated attack, the time (pickup_datetime, dropoff_datetime) and location (pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_longitude) columns of the taxi database were used. The attacker has full knowledge of a “known” dataset that contains a substantial proportion of rows that may also be found in an “unknown” dataset. In the case of the demonstrated attack, half of the rows in the known dataset also were in the unknown dataset.

The demonstrated attack required that there is a column that uniquely identifies users in both the known and unknown tables. The attack required that the ID value for any given user is the same in both tables. The attack required full knowledge of the users being attacked. (The goal is not to learn anything new about the users per se, only whether they are in an unknown dataset or not.)

The attack used machine learning techniques. Training takes place on the known dataset, but deployed behind an Aircloak system. The training is used to predict the presence or absence of a user. It consists of roughly 20,000 queries spread over a location grid and time units. The attack itself required roughly 10,000 queries. The attack focused on the 100 users (taxi drivers) with the most rides.

The queries themselves used the SQL IN clause to select the users being attacked:

… WHERE uid IN (user1, user2, user3, …., userN) …

50 of 51 claims were correct.

The attack generally works against datasets where users have a substantial number of entries (rows) in both the known and unknown datasets. This is required to provide enough data for the training and attack phases. The fewer the rows, the less effective the attack. When attacking 100 random users rather than the 100 users with the most rides, 32 of 50 claims were correct.

The fix deployed by Aircloak has two parts. The first part is to determine which columns may potentially be used in an attack. These are columns where a substantial proportion of values uniquely identify a user (at least 80% of values), and are called “isolating”. The second part is to limit the number of values from isolating columns that can be in an IN condition.