Ladies and gentlemen, let’s pack our things. Unsubscribe. Turns out anonymization is impossible. “Your Data Were ‘Anonymized’? These Scientists Can Still Identify You”, New York Times titles. TechCrunch writes, dramatically: “Researchers spotlight the lie of “anonymous” data” and Mashable apologizes on the spot: “Sorry, your ‘anonymized’ data probably isn’t anonymous”. The Guardian, Forbes, CNBC, and many others ring the same alarm bells.
So, was it all for nothing? Has the era of privacy finally ended, with evermore data and computational power to break every protection? Read to the end to find out! But here’s a spoiler: The answer is no.
In fact, anonymization isn’t in trouble at all.
Bad anonymization is. As it should be.
Let’s take a quick look at the original paper by Luc Rocher, Julien M. Hendrickx and Yves-Alexandre de Montjoye, called “Estimating the success of re-identifications in incomplete datasets using generative models” (note how they write “incomplete”, not “anonymized”).
- They look at datasets that were manipulated “through de-identification and sampling before sharing”, though the paper addresses only pseudonymization and sampling and not other forms of de-identification such as aggregation or cell swapping.
- They find out that sampling rates previously thought to be anonymous may not be enough to sufficiently anonymize a complex dataset, and that those datasets “are unlikely to satisfy the modern standards for anonymization set forth by GDPR”.
Let me repeat that: what they’re writing is in fact that if you only use pseudonymization and data sampling then that won’t satisfy modern standards for anonymization and is in fact not anonymization according to the European regulations. To use the terminology from one of my favourite recent tweets: If this is how you’re securing your data, you aren’t doing anonymization; you are doing Anonymization ;)™️.
Not to downplay the importance of their paper – it does a good job showing that sampling techniques may be weaker than previously thought, and it’s important to keep the public discussion around this topic alive – but de-identification methods have been debunked many times before, and the authors even provide a long list of failed Anonymization ;)™️ attempts themselves. I’ve written myself about the myths of anonymization and how one of them is that anonymization isn’t possible at all – read the post if you’re interested in the finer details.
In contrast to various news outlets, the authors are not saying anonymization is doomed, but they are challenging what they call the “de-identification release-and-forget model”. Specifically what they mean by this is the “pseudonymize and sub-sample release-and-forget model”. They don’t consider other de-identification mechanisms like aggregation, cell swapping, top- and bottom-coding, rounding, and adding noise.
Still, it is good to see yet another example of how easy it can be to get anonymization wrong. True anonymization is hard, and too much humbug is sold under that name. Political initiatives, standardization and certifications are urgently needed to weed out bad actors and bring a high level of privacy protection to everyone.
To paraphrase Thomas Edison: It’s not that anonymization has failed – it’s just that we’ve found a hundred ways anonymization does not work.
But contrary to Edison, we already know what good anonymization looks like, and we provide it to our customers on an everyday basis.
And while the era of Anonymization ;)™️ might finally come to an end, the era of privacy technology is only just starting.