Fix for the MIT/Georgetown Univ Attack on Diffix



October 17, 2018 by Paul Francis

One of the two successful attacks from the first round of the Aircloak bounty program was from researchers Aloni Cohen and Kobbi Nissim from MIT and Georgetown University respectively. Our fix for the attack was deployed in Aircloak’s release 18.3 on July 27th. Aloni and Kobbi had asked Aircloak and myself not to reveal the details of the attack so that they could have the opportunity to get a publication from the attack, and we were happy to comply. On October 15th 2018 they presented the attack at the “Theory and Practice of Differential Privacy” workshop at CCS 2018 in Toronto, so congratulations on that! The talk was live-tweeted by @TedOnPrivacy (aka Damien Desfontaines) and so this is a good opportunity for me to talk about the attack fix as well as comment on the tweet-stream.

And we’re off!

Maybe I’m a little too paranoid, but I can just see the battle lines being drawn here: young and brilliant researcher exposing the hubris and incompetence of yet another evil corporation. Well ok, I see the ‘:D’ thingy, which I’m pretty sure means humor (though it might mean omg), but humor’s no good if there isn’t some truth behind it.

Which begs a good question: who am I, and on which side of the battle am I?

I’m Paul Francis. I’m a researcher at the Max Planck Institute for Software Systems (MPI-SWS) in Germany. I’m also a co-founder of the startup Aircloak. I work in a research partnership with Aircloak because it brings tremendous value to my research (more on this later). So I have a foot in both camps. I’m also the middle child of seven children, so I’m good at seeing both sides.

Sorry if the following seems like too much side-tracking. It’s important to understand the context in which Aloni’s attack takes place, and I don’t mean here just the context of the bounty program, but the broader context of how academia and industry view the data anonymization problem and why there is so little fruitful exchange between those two worlds.

All security mechanisms have something they need to enable and something they need to prevent. Symmetric key crypto enables a trusted party to read a message at very little computational cost. It prevents an untrusted party from reading the message without a prohibitive computational cost.

We all know what data anonymization is supposed to prevent, at least informally—it prevents an untrusted party from learning about individuals in the data. What does it enable? My answer: it allows an untrusted party to learn as much as possible about the data (while of course still preventing what it needs to prevent). Or, in short, anonymization should enable analytics by an untrusted party.

What does that even mean? There are thousands of analytic tasks. This page lists 16 R manuals of less than 100 pages, and 24 manuals of more than 100 pages. A postgresql comprehensive manual is 3437 pages long. Now of course a single anonymization method doesn’t need to enable all of analytics, but it does seem to me that one should try to to enable as much analytics as possible when designing an anonymization method.

Suppose you want to do data anonymization research and you are serious about the enabling thing. You have two big questions: What should I enable? How will I know if I’ve enabled enough? The way I decided, as a researcher, to answer those questions is to engage with the market and let it tell me. This is the reason I work closely with Aircloak. It allows me to engage the market in a meaningful way.

Very early on (six years ago) we decided to try using Differential Privacy (DP), but it didn’t take long to realize that there wasn’t enough analytics there to sustain an anonymization company. This led us to abandon formal or mathematical approaches. The market does not demand formal approaches, but they do demand evidence of strong anonymization, as do we, and this eventually led to the bounty program.

What’s the point of all this? I would characterize the academic community’s current approach to anonymization research as starting from the basis of provable anonymization (DP), and then trying to figure out what it can be used for. By contrast, we are committing ourselves to useful analytics while trying to ensure that it is strong enough. Perhaps not surprisingly, while Aircloak is paying people to attack a system which is demonstrably broadly useful, NIST is paying people to find uses for DP which is provably anonymous!

Much of academia sees Diffix as too informal and therefore too weak on privacy (and unfortunately my publication record in recent years attests to this). At the same time, there are some in industry that see Diffix as too strict and therefore not providing enough utility. For instance, a rule-of-thumb among some of industry is that any anonymization scheme with less than 1% re-identification risk can be regarded as anonymous enough. Aircloak finds itself fending off criticism from both sides!

I guess overall what I’m trying to say here is that DP and Diffix are approaching the same problem from different angles. We accept, even embrace the “arms race” that academia eschews. We try to accelerate it through transparency and the bounty program, and users of our technology manage risk by deciding who is allowed access to the Aircloak system.

Ok, back to the tweet-stream:

We did have protections, but we failed to anticipate all the ways in which the attack could be run. Note that Aloni didn’t attack the entire database, in part because it wasn’t necessary to do so to demonstrate the viability of the attack or get the maximum bounty payout. Just how much of a given database could be reconstructed with the attack is unknown. I say this just to correct the tweet, not to suggest that an attack is unimportant if it can’t reconstruct the whole database.


The basic idea of the attack is to count the number of users that have a certain attribute for many different combinations of those users (1000 or so such combinations). For each count, the attacker knows the set of users that comprise the count. Selecting random subsets of users will do the trick.

When I first looked into this attack, I presumed that in order to select the users, the attacker would need to enumerate them all. In SQL, this would require a condition statement like “WHERE user_id IN (1,3,6,7,9,11,….)”. This is what Aloni is referring to as the “naive reconstruction”. It fails because Diffix adds additional noise for each of the conditions (“user_id = 1”, “user_id = 3”, “user_id = 6”, and so on). In this case, Diffix does generate noise proportional to sqrt(number of people), as the tweet says, and this is too much noise for the attack to work. This is described in the arXiv paper on Diffix.

Note of course that such a large amount of noise would also hurt analytic quality, but normally an analyst does not try to enumerate a large number of users, so this doesn’t hurt normal usage of the system much. Note also that in general, however, Diffix adds noise proportional to sqrt(number of conditions) which is of course almost always far less than sqrt(number of people).

This tweet is partially true. Diffix does put a bunch of restrictions on SQL, but in order to prevent a variety of other attacks, not to prevent hashing per se. As a rule, we try to allow as much SQL as we can while still protecting known or potential attacks. Here we got it wrong. Aloni’s attack could generate random enough sets of users with a condition like this: WHERE floor(5 * pow((user_id * 2),0.5) + 0.5) = floor(5 * pow((user_id * 2), 0.5)). Before the fix, Diffix would allow this but viewed it as a single condition and so added only a single layer of noise. By changing the various constants in the condition, Aloni’s attack can create different sets of random but known users.

Thanks Aloni for clarifying this. Again, not to diminish the importance of the attack, but the attack reconstructed whether or not users had “classic” credit cards, a condition which was true for 75% of the users, and it did with perfect accuracy.

The Fix

Note that the “they” of “they don’t really know how” is I assume referring Aloni and Kobbi, not to us.

So, yes we did fix it, but what exactly did we fix? Just this attack, or this class of attacks?

Our fix is slightly more general than “just this attack”, but certainly does not try to defend against the class of linear program reconstruction attacks. DP of course does, but I’m not sure how we could do it with Diffix and still retain utility.

What we do instead is further restrict SQL. Specifically, we prevent IN, all math, and allow only the following set of functions: lower, upper, substring, trim, ltrim, rtrim, btrim, extract_words, year, quarter, month, day, hour, minute, second, weekday, date_trunc, and bucket.

This is of course highly restrictive, so we apply these restrictions only on columns that are at risk of being able to isolate users. By this we mean columns where a substantial fraction of values uniquely identify a single user. Aloni’s attack for instance used the user_id column, which by definition has one value per user. Other columns may have this characteristic as well, for instance email address, SSN, etc.

In the fix, Diffix measures each column in the background to determine what fraction of values uniquely identify a single user. In the current setting, if 50% or more of the values do so, then we regard the column as “isolating”, and apply the above restrictions. Otherwise we do not. The measurement is done periodically, and all columns are regarded as isolating until the first measurement is done.

By allowing the functions described above, we enable a variety of aggregations for strings (substring, trim, etc.), datetimes (month, day, etc.), and numbers (bucket). We don’t believe that enough randomness can be obtained with these functions alone.

Obviously this is not a perfect fix. First and foremost, there may be other ways of selecting enough random sets of users with one or a very small number of conditions. If such ways are discovered, then I imagine we’ll apply specific restrictions to mitigate them. As I said before, the balancing act for us is to maximize utility while preventing known attacks. If the restrictions are too onerous, then we will have failed in our goal of achieving both high utility and strong privacy. Second, it may be that a column is not isolating overall, but is isolating when some set of other conditions are applied. I can’t think of cases where this may occur, so if it does at all it should be rare. Nevertheless if such occurrences are found we could for instance add mechanisms to detect them as well and apply restrictions as needed.

I understand that most academics will find this informal patching approach to be deeply unsatisfying—it is in many respects the same for us. My hope is that at some point other researchers will nevertheless see the value in our general approach of emphasizing utility, and join in and work towards making Diffix more disciplined.

Finally, I’d like to thank Kobbi and Aloni for their attack. They were a pleasure to work with, and Aloni even found a couple bugs in our system in the process. I deeply appreciate their contribution to strengthening Diffix, and hope they’ll join in the next round of the bounty program.


Categorised in: , ,