Diffix Cedar bounty prize awarded to UPenn students



August 25, 2020 by Paul Francis

I would like to congratulate Matthew Joseph, Zachary Schutzman, and Travis Dick, of the University of Pennsylvania, for their brilliant de-anonymization attack on Diffix Cedar as part of the MPI-SWS Diffix bounty program. For their effort, the UPenn team received $5150 out of a possible $10000 payout.

The attack was demonstrated in May 2020 on the version Diffix Cedar. The fix was shipped in release version 20.2 (Diffix Dogwood) on August 13.

The Diffix Cedar bounty consists of two parts, each worth $5000. The effectiveness part measures whether anything at all can be attacked, whereas the coverage part measures how much can be attacked. The UPenn team received the full $5000 for the effectiveness part, and $150 for the coverage part.

Aircloak assigns a severity level to publicly announced attacks. The purpose of the severity level is to allow Aircloak customers to prioritize their own risk assessments of the vulnerability. The Aircloak severity level is a conservative value which assumes that an attacker can re-identify individuals and knows of the attack (see this article for an explanation of bounty payouts and severity levels and how they relate to any given customer’s actual risk).

This attack was assigned a severity level of ‘moderate’, the middle level on a 5-point scale from ‘very low’ to ‘very high’.

The real risk of this attack being exploited in any of Aircloak’s customer base during the period between discovery and fix was negligible.

The technical details of the attack are described in the appendix. 

The UPenn attack scores

The Diffix bounty program uses the GDA Score as the basis for determining the payout. The GDA Score defines three types of attacks. Of these three, the UPenn attack used the singling-out type, where the attacker makes a claim of the form “There is a single individual with the following attributes.” The UPenn team attacked the taxi database, which is one of the public databases used for the Diffix bounty program. The individual in this database is the taxi driver (not the passenger). The taxi database contains New York city taxi trips, and can be viewed at https://db001.gda-score.org/, table “raw_taxi”.

Effectiveness:

The most effective single attack that the UPenn team produced, achieved a 100% confidence improvement (all claims correct). See https://www.gda-score.org/mpi-sws-diffix-challenge-2020/ for more information about the confidence improvement measure.

Coverage:

The attack required three columns, two known and one unknown. If we consider all possible combinations of three columns from the 29 columns of the taxi database, then (in rough numbers):

  • 3% of the combinations produced a 95% or better confidence improvement
  • 4% of the combinations produced a  50% confidence improvement
  • 2% of the combinations produced a 15% confidence improvement
  • 15% of the combinations produced a 6% confidence improvement
  • The remaining 76% of combinations were not effective or could not be run

Fix

The UPenn attack is based on the Linear Reconstruction attack (LR) first published by Dinur and Nissim in 2003. (See the appendix for the technical details of the attack.) The LR attack requires that the attacker has the ability to select many different random or well-mixed groups of users.

To do this, the UPenn attack requires the use of a complex mathematical expression that combines some kind of “bucketing” function (floor(), ceil(), round(), mask to int, bucket(), etc.) with some arithmetic computations. The fix implemented by Aircloak is to disallow any arithmetic computations on columns where a bucketing function is used.

No analytic tasks in use by Aircloak customers required the combination of bucketing functions and arithmetic expressions, so the fix has little impact on utility.

Severity

Aircloak assigns a severity score to vulnerabilities so that customers can prioritize their own risk assessments. Aircloak’s score is based on four primary risk factors. Each factor is assigned a risk level of low, medium, or high.

Required prior knowledge:An attack may require specific knowledge, either external to the database or from the database itself. Factors include how much prior knowledge is needed as well as the generality of the prior knowledge (whether only certain types of data can be used).

The UPenn attack required prior knowledge of part or all of two columns. One of the two columns is used to select a relatively small number of distinct users (around 500). For all rows of data so selected, the attacker must know all or nearly all of the values of the second column. For example, if there is a column identifying users as being in the HR department of a company, this column could be used to select the subset of users, and the attacker would need to know all values of a second column.

Given the amount and specific type of prior knowledge needed, the prior knowledge risk factor is low.

Detection exposure:Queries in the Aircloak system can be logged. If the attack leaves a clear signature, then this serves as a deterrent.

Each individual UPenn attack requires between roughly 500 and 20000 queries. Learning values from columns with more distinct values requires more queries. The queries themselves have a distinctive signature. Thus the detection exposure is high. Each demonstrated attack learned values for around 500 individuals. However, all of the queries are required even if the attacker is interested in only a single individual.

Given the large number of queries and distinctive signature, the detection exposure risk factor is low.

Data conditions: Specific conditions in the data itself may be required for the attack to work.

For the UPenn attack, the second of the prior knowledge columns described above has specific requirements. First, the column must be numeric. Second, somewhere between roughly 50% and 80% of the values in the column must be unique to a single individual. Above 80% and the attack was prevented, and as the percentage goes down, the quality of the attack degrades. In particular, an identifying column cannot be used. This eliminates the use of columns like account number, that might be public knowledge.

Given these requirements, the data conditions risk factor is medium.

Generality of learned information: An attack may be able to learn certain types of data but not others.

In the UPenn attack, any column with relatively few distinct values, as well as any numeric or datetime column, could be learned. Learning text columns with a large number of distinct values was not demonstrated, but the UPenn team believes that they are learnable as well.

Given this, the generality of learned information risk factor is high.

According to Aircloak’s guidelines for assigning severity levels, the fact that two of the four factors have medium or high risk leads to a severity score of ‘moderate’.

Thanks to the UPenn team!

Once again, we would like to congratulate the members of the UPenn team, and thank them for helping make Diffix a stronger anonymization system. We look forward to more attacks from the UPenn team and from others.


Appendix: Details of the UPenn attack

Background:

The UPenn attack is based on the Linear Reconstruction attack first published by Dinur and Nissim in 2003. This is not the first reconstruction attack on Diffix: Cohen and Nissim received a bounty payout on a reconstruction attack from the first bounty program. To understand the innovation in the UPenn attack, it is useful to understand the Cohen/Nissim attack and subsequent defense implemented in Diffix Cedar.

The reconstruction attack requires that the attacker is able to make queries that select different subsets of users from a given set of users. Selecting random subsets is a good approach. At the time of the Cohen/Nissim attack, the defense against the reconstruction attack (Diffix Birch) assumed that the users would be individually specified in an IN() clause:

uid IN (3, 17, 28, 54)

Diffix Birch would add a noise layer for each element of the IN() clause, and the amount of noise was enough to prevent an accurate reconstruction. Cohen/Nissim were able to bypass the noise by expressing the random selection in a single WHERE expression, for instance:

WHERE floor(5 * pow((client_id * 2),0.5) + 0.5) = 
      floor(5 * pow((client_id * 2), 0.5)) AND
      client_id BETWEEN 1000 and 2000

 

The second expression (BETWEEN) is needed to scale the attack to a manageable number of users.This first expression selects a random subset of users from the set of users that fall in the range of the second expression. By changing the various constants in the first expression, different random subsets are chosen.

The defense against this attack in Diffix Cedar was to eliminate the use of math expressions for columns where all or most values in the column each identifies a user. Based on measurements of the attack, we decided that so long as 80% of values in a column are unique to a single individual, then math would not be allowed. Below 80%, the reconstruction was not very accurate.

The insight of the UPenn team was that by using a different column to scale the attack, and by selecting among few enough users, then with high probability few if any of the values in the identifying column would apply to more than one user.

By way of example, a WHERE clause from a UPenn attack query is:

WHERE floor(pickup_latitude ^ 8.789 + 0.5) = 
      floor(pickup_latitude ^ 8.789)
      AND trip_distance IN (0.87, 1.97, 2.75)

 

The first expression selects the random subset. The math is a bit different from the Cohen/Nissim query, but the effect is the same. We may refer to this column (‘pickup_latitude’) as the mixing column. 

The second expression, however, is using a different column from that of the first expression (‘trip_distance’ instead of ‘pickup_latitude’). As a result, a ‘pickup_latitude’ that is common to several users in the whole table will likely be unique to a single user in the subset of rows selected by the ‘trip_distance’ expression. We may refer to this as the selecting column.

The UPenn attack used for the effectiveness score determined the ‘vendor_id’ value of the driver. The vendor_id in the taxi database identifies the company that runs the taxi service. There are two vendors, split about evenly between the rides. The UPenn attack is more effective when the learned value is relatively more common. Since the goal of the attacker is to achieve the best effectiveness score, a natural choice for the UPenn team is to try to learn ‘vendor_id’. A complete query in this case is:

SELECT vendor_id, count(*)
FROM rides
WHERE floor(pickup_latitude ^ 8.789 + 0.5) = 
          floor(pickup_latitude ^ 8.789)
      AND trip_distance IN (0.87, 1.97, 2.75)

 

But what if the attackers want to learn something more interesting or more sensitive than ‘vendor_id’, for instance the ‘ssn’ (social security number) column? The UPenn team was able to do this by attacking one character at a time. ‘ssn’, for instance, has only 10 possible choices in each character position, and so each character symbol is relatively common.

The UPenn team could learn the first character of each driver, for instance, with this query:

SELECT substring(ssn FROM 1 FOR 1), count(*)
FROM rides
WHERE floor(pickup_latitude ^ 8.789 + 0.5) = 
          floor(pickup_latitude ^ 8.789)
      AND trip_distance IN (0.87, 1.97, 2.75)

 

Attacking ‘ssn’ in this element-wise fashion resulted in about 90% correct guesses (versus 100% for ‘vendor_id’). The UPenn team was not able to run this attack on columns with more symbols (for instance ‘firstname’, which has 26 symbols per character position). The immediate problem was processing power: the attacks were run on a laptop which was not powerful enough to solve the equations. The UPenn team believes however that the attack on strings with more symbols is possible.

Prerequisites:

A claim in a GDA Score singling out attack has two types of columns: 1) the columns that are being learned, and 2) the columns that are a priori known (if any). Recall that the payout amount depends on how much prior knowledge is needed relative to what is learned. The attacker must therefore specify what is prior knowledge and what is learned so that the payout can be computed.

In the UPenn attack, the ‘pickup_latitude’ and ‘trip_distance’ values are known a priori, and the ‘vendor_id’ is what is being learned. In other words, the claim in essence says this: “I already happen to know as prior knowledge that there is a ride with pickup_latitude=-73.782234 and trip_distance=1.97. I claim that this ride also has vendor_id=’CMT’, and that only one driver has a ride like this.”

For the demonstrated attacks, the following prerequisites held:

  • All or nearly all values of the mixing column (‘pickup_latitude’) that are selected by the selecting column (‘trip_distance’) must be known.
  • Between roughly 50% and 80% of the values in the mixing column must be distinct to a single individual. Above 80% and the column will be identifying and therefore math can’t be used. As the percentage goes down, the quality of the attack degrades.
  • The mixing column must be numeric.
  • To get a 100% confidence improvement, the unknown column must have relatively few distinct values (roughly 10 or less). Confidence improvement decreases with an increase in the number of distinct values. (Note that this is a requirement of the bounty score itself. In practice the unknown column may be partitioned into a small number of ranges, and the range to which an individual belongs could be learned.)
  • Element-wise attacks required roughly 10 or fewer distinct symbols per character position to get a 90% improvement.

Attack:

For learned columns with few distinct values, the attack requires roughly 1000 queries of the following form:

SELECT vendor_id, count(*)
FROM rides
WHERE floor(pickup_latitude ^ 8.789 + 0.5) = 
          floor(pickup_latitude ^ 8.789)
      AND trip_distance IN (0.87, 1.97, 2.75)

 

The condition with ‘trip_distance’ must select a few 10s of users. The exponent in the ‘pickup_latitude’ condition is modified for each of the 1000 queries such that a relatively random set of rows is selected among the 10s of users.

The answer from each of the queries contributes one equation to a set of equations that is then solved to determine the ‘vendor_id’ of each user with a distinct ‘pickup_latitude’ and ‘trip_distance’.

Element-wise attacks require one such set of queries and equations per character in the attacked string.

 


Categorised in: