From time to time a participant in a Diffix bounty program runs a successful attack and receives a payment. One might take this to mean that individuals in the databases of organizations that use Aircloak data anonymization were at serious risk. This is not necessarily the case, and so far has never been the case. The Diffix bounty program is designed to incentivise attackers to discover any vulnerability, whether serious or trivial.
In this article, we discuss the Diffix bounty program: why we need it and how we designed it. We also discuss the severity rating we assign to vulnerabilities: what they mean and, especially, what they don’t mean.
Why a data anonymization bounty program?
Every year hundreds of companies spend millions of dollars on bug bounties. The idea is simple: incentivise the good guys to find vulnerabilities, and fix them before the bad guys find them.
In spite of the recent growth in privacy technology companies, the idea for a data anonymization bounty hasn’t caught on. Diffix is the first and still only data anonymization for which there is a bounty program. The first bounty in 2018 was for version Diffix Birch and was run by Aircloak. The second bounty, for version Diffix Dogwood, run during the latter half of 2020, is sponsored by MPI-SWS under the auspices of the Open GDA Score Project.
Perhaps the main reason bounty programs for data anonymization aren’t common is because there simply aren’t any reported malicious attacks on data anonymization. This means that either non-malicious attacks (mostly by academics) are exposing vulnerabilities before bad guys have a chance to exploit them, or that somehow all malicious attacks go unreported. We suspect it is the former. Either way, companies making data anonymization solutions don’t have a strong motivation to invite attacks, and associated negative publicity, through a bounty program.
With Diffix, however, Aircloak aims to democratize data anonymization, making it available to many organizations for many use cases. We believe that the best way to stay ahead of the bad guys is with transparency and a proactive approach to finding vulnerabilities. So far this has worked well for us. Diffix vulnerabilities have always been fixed well in advance of any malicious threat.
What does it mean to pay out a bounty?
The award structure for the Diffix bounty is designed so that a payout can be made even for attacks that are not practical and that are highly unlikely to lead to re-identification of an individual. What do we mean by this?
There are two necessary components to re-identifying an individual in a database (assuming an attacker has access to the database and wishes to attack it). The first is simply to isolate an individual. In other words, to be able to say “these attributes (age, zip code, gender, whatever) describe a single individual, even though I don’t know who that individual is.” Data where individuals are isolated, but where the information needed to identify them is hidden, is referred to as pseudonymous by the GDPR.
The second step is to identify the individual. That is to say “the individual with these attributes is Paul Francis of Kaiserslautern, Germany”. When this is done, the individual has been fully re-identified.
The award structure for Diffix does not require that an individual be re-identified, only that they be isolated. In fact, even if only a single individual can be isolated, a payout is made. To put this in perspective, a data set can be HIPAA-compliant even where every individual can be isolated: a HIPAA-compliant data set would lead to the maximum payout in a Diffix bounty.
The Diffix bounty payment in fact has two parts, each worth $5000. The two parts correspond to the two questions
- Does the attack work at all, even on a single individual?
- If so, how broadly applicable is the attack?
The first part gives a payout even if only a single individual can be isolated, as just described. We call this the effectiveness score. The size of the bounty grows as the percentage of correct claims goes up, and as the amount of prior knowledge required to make a claim goes down. Prior knowledge is information about the database that the attacker knows prior to making the attack. The bounty program allows the attacker to have as much or as little prior knowledge as the attacker wishes.
The other part gives a payout depending on how many individuals can be isolated, on how many attributes can be learned about the individuals, and on what prior knowledge is needed. We call this the coverage score.
The Aircloak severity score
Before announcing a vulnerability to the public, participants in a Diffix bounty program agree to wait until Aircloak has engineered a fix for the vulnerability and deployed the fix to customers. Aircloak is given up to one year to do so.
Once a vulnerability has been reported to Aircloak and validated, Aircloak assigns a severity score to the vulnerability and informs its customers of the vulnerability.
The main purpose of the severity score is to inform Aircloak customers as to how quickly they need to do a risk assessment for their own environment. The score has five settings from ‘very low’ to ‘very high’. The interpretation is as follows:
- Very high: The customer should immediately assess the vulnerability. It may be necessary to severely limit access to the Aircloak system or even curtail operation. For Aircloak, eliminating the vulnerability, or developing tools to detect attacks using the vulnerability, takes immediate priority.
- High: The customer should quickly assess the vulnerability, but may reasonably take a few days to do so. It is unlikely that immediate additional access limitations would be necessary. For Aircloak, fixing the vulnerability takes high priority, and is likely to lead to an interim version release.
- Moderate: The customer may assess the vulnerability when it is convenient to do so. It is unlikely that any special action is required. For Aircloak, the vulnerability fix can probably wait until the next regularly-scheduled release cycle.
- Low: The customer can probably ignore the vulnerability. For Aircloak, the vulnerability fix takes low priority, and might not even be fixed in the next release cycle.
- Very Low: The vulnerability is ineffective.
The severity score is necessary because the bounty payout amount is not a particularly good indicator of how severe a vulnerability might be. A substantial bounty payment may be made on an attack that is low risk in practice.
The primary risk factors that Aircloak takes into account are:
- Required prior knowledge: An attack may require specific knowledge, either external to the database or from the database itself. Factors include how much prior knowledge is needed as well as the generality of the prior knowledge (whether only certain types of data can be used).
- Detection exposure: Queries in the Aircloak system can be logged. If the attack leaves a clear signature, then this serves as a deterrent.
- Data conditions: Specific conditions in the data itself may be required for the attack to work.
- Generality of learned information: An attack may be able to learn certain types of data but not others.
It is also important to recognize that Aircloak’s severity score is conservative in that there are two important risk factors that we do not assess ourselves but rather assume to be the worst case.
First, Aircloak does not take into account the difficulty of re-identification in its severity score. We assume that re-identification is possible even though in practice it may well not be for most attackers or most individuals in the database.
Second, Aircloak does not take into account the likelihood that an analyst has knowledge of the attack. For customers that use access controls to determine which analysts can use the system, for instance, the chances that any given analyst knows of an attack that has only recently been discovered by Aircloak or a bounty participant is very low.
Ultimately though it is up to the customer to assess risk. Aircloak provides a detailed description of the vulnerability and a severity score to help the customer determine urgency, but otherwise only the customer, with detailed knowledge of its data and its analysts, can determine risk. Indeed it may be that a severity score of ‘high’ or even ‘very high’ may well constitute very little risk for certain customers.
Given all the factors to consider, assigning a severity score is hardly an exact science. Aircloak assigns scores according to the following general guidelines:
- Very high: All of the risk factors are evaluated to be high risk. This score is reserved for cases where either no prior knowledge is needed or at most only prior knowledge of a few facts about the victim are needed, where very few queries are needed and the queries do not appear very different from normal queries, where no special data conditions are required, and where almost anything can be learned.
- High: Three of the four risk factors score at high risk.
- Moderate: Two of the four risk factors score at high risk.
- Low: One of the four risk factors scores at high risk.
- Very Low: None of the four risk factors scores at high risk.
Goal: no known vulnerabilities, high utility
Aircloak is committed to providing high-utility analytics while maintaining strong anonymity. Practically speaking this means forgoing formal anonymization models like differential privacy and K-anonymity which have undoubtedly strong anonymity but rarely acceptable utility.
The goal of good utility leads to a complex system that is hard to analyze formally. This has led to a structured but informal development process based on the following three principles:
- No known vulnerabilities. At all times, we strive to eliminate any known vulnerabilities. Of course there is always a lag between finding and fixing vulnerabilities, and some vulnerabilities just aren’t that serious and don’t need to be immediately fixed. Nevertheless, the basic stance is to eliminate all known vulnerabilities.
- Proactively find vulnerabilities. It is of course meaningless to eliminate known vulnerabilities if one makes no effort to find them. This is why Aircloak is transparent and has a bounty program in addition to its own internal security analyses.
- Utility first. Within the “no known vulnerabilities” framework, we take a utility-first stance. What this means for instance is that we allow new types of queries so long as we are not able to find vulnerabilities. We constantly push the envelope of utility. When vulnerabilities with a given query type are discovered, we either find a fix that allows the query type, or disallow the query type. The multiple versions of Diffix (Diffix Aspen, Diffix Birch, Diffix Cedar, Diffix Dogwood…) reflect this development process.
Indeed our very definition of “strong anonymity” is where there are no publicly known vulnerabilities, and where privately known vulnerabilities are eliminated before they become publicly known.
An ongoing process
Aircloak is leading the way in the democratization of data anonymization. We are moving it from a specialized discipline requiring experts to design purpose-built anonymization solutions on a case-by-case basis, to a general-purpose software tool that can be installed and used without any particular anonymization expertise.
As part of this, we are also leading the way in how to evaluate and test data anonymization. We are constantly learning from our mistakes as we go. The bounty program and “no known vulnerabilities” approach is by no means perfect, but so far has worked well. It is improving all the time as we gain experience, and we hope that our approach becomes a model for other data anonymization professionals.
Categorised in: General
Bounty Program Diffix Privacy Challenge