A frequent assumption among privacy professionals is that any data anonymization system that allows unlimited queries leads to serious privacy loss. This article debunks that assumption with a simple example derived from the seminal 2003 paper of Dinur and Nissim. We give examples of how Diffix achieves strong anonymity and good utility with unlimited queries (and no budget), and suggest that core Diffix mechanisms, like sticky layered noise, may well be the only way to achieve these properties.
In 2014, Aaron Roth and Cynthia Dwork introduced the Fundamental Law of Information Recovery in the book “The Algorithmic Foundations of Differential Privacy.”
As stated in that book, the Fundamental Law is this:
“Overly accurate answers to too many questions will destroy privacy in a spectacular way.”
The Fundamental Law plays an important role in NIST’s endorsement of the need for Differential Privacy in the US Census. The NIST document invokes the Fundamental Law several times, paraphrasing it as saying “that even in the presence of noise one can determine the contents of a private database by issuing and receiving the responses to too many queries.” Closer to home, this idea that unlimited queries leads to massive privacy loss has been used to dismiss Diffix as not being private.
Roth and Dwork derive the Fundamental Law from a seminal paper in 2003 by Dinur and Nissim, “Revealing information while preserving privacy.” This paper observes that there are two main approaches to preserving privacy in a database. One is to perturb the data or the answers. The other is to limit the type of questions that may be asked. Dinur/Nissim posit an anonymization system that perturbs the data, and then goes on to show how data perturbation is insufficient to protect anonymity unless either the amount of perturbation (added noise) is high enough, or the number of allowed queries is low enough.
In this article, we modify the Dinur/Nissim system to demonstrate that privacy can be protected even when unlimited queries are allowed. The modification limits the types of queries rather than the number of queries. The insight is not new — Differential Privacy researchers have studied a variety of similarly simple systems in the past — but here serves to show that an interpretation of the Fundamental Law where unlimited queries necessarily leads to unlimited privacy loss is false.
Our example system, while being simple, is not practical because it allows only one type of query. Our goal in developing Diffix is to design a practical anonymization system with the following important properties:
- Strong anonymity (in the GDPR sense though not a mathematical sense),
- Remarkably good utility (compared to both Differential Privacy and traditional mechanisms),
- Ease-of-use (no special anonymization expertise is required), and
- Generality (works for all basic data types and a wide range of use cases).
We believe that good utility cannot be achieved without a rich query language and unlimited queries. At the same time, this means that we give up on the mathematical guarantees characterized by Differential Privacy. This article goes on to describe how we manage the process of ensuring that Diffix has strong anonymity, primarily through transparency and bounty programs. Finally, the article gives examples of Diffix mechanisms that combine both perturbation and type limiting to achieve the above four properties.
Not only is Diffix the only anonymization system to achieve the above four properties, it may well be that some of the core mechanisms in Diffix are the only way to achieve these properties.
The Reconstruction Attack
The 2003 paper by Dinur and Nissim is seminal in the development of both Differential Privacy and the Roth/Dwork Fundamental Law of Information Recovery. The paper introduced the idea of a reconstruction attack, whereby an unknown dataset could be fully reconstructed even when noise is added.
The paper proposes an anonymization system and then demonstrates the reconstruction attack on it. The system is this:
- There is a simple database table. There is one column, and the values in the column are either 0 or 1. Each row in the table pertains to one distinct individual.
- The attacker can make queries about the table through a query engine. The query engine allows analysts to request the sum of column values for some specified set of rows. For instance, “What is the sum of values for rows 2, 7, 18, and 24?” (Other types of queries may well be allowed, but they are not needed for the attack so we can disregard them.)
- The query engine randomly perturbs the answer.
The attacker does not know the column values for each row and wishes to reconstruct each row in the table in the correct order.
The gist of the reconstruction attack is this. The attacker creates random sets of users and executes a query for each set. Each such query produces a pair of equations of the form:
R2 + R7 + R18 + R24 > C1 R2 + R7 + R18 + R24 < C2
where Ri is the value for row i (0 or 1), and C1 and C2 are the lower and upper bounds of the noisy count. All the queries taken together form a set of linear equations that can then be solved for the variables Ri, thus reconstructing the data.
Dinur and Nissim proved that the attack works so long as the amount of noise is small enough, and perhaps more importantly that the attack fails if the amount of noise is large enough. What’s more, even if the amount of noise is small, if the number of queries allowed was correspondingly small, then the attack still fails.
Furthermore, the result holds even if the attacker had prior knowledge of values in the tables, up to and including knowledge of all but one value in the table.
This was a very exciting result. It suggested that, no matter what the attacker knows, and no matter what query the attacker makes, so long as limits are placed on the number of queries (relative to a certain amount of noise), the attack fails. Within a couple of years, the result was formalized as Differential Privacy. A decade later, Roth and Dwork used it to derive the Fundamental Law.
What about limiting query types?
Imagine a new anonymization system:
- The table is the same as Dinur/Nissim’s.
- The query engine allows only one type of query, which is to count the number of rows that have a certain value.
- The query engine randomly perturbs the answer.
- Each distinct query may only be answered once. (Needed to prevent repeated identical queries from averaging away the noise.)
This system allows an unlimited number of different queries: number of rows with value 0, number of rows with value 1, number of rows with value 2, etc. The reconstruction attack does not work on this system because there is no way to select random sets of users. Indeed researchers have shown that simple systems like this can satisfy Differential Privacy.
Furthermore, the amount of noise can be quite low. Even on the assumption that the attacker knows the value of every row except one, and knows that values are constrained to be 0 or 1, all that is needed is enough noise to give the attacker substantial uncertainty as to whether that unknown user is a 1 or a 0. Selecting noise within a range of plus or minus 5 would certainly do it.
Critically for utility, however, is the fact that this system allows an analyst to know everything that is statistically interesting about the table. Assuming that the order of rows is not important, the only meaningful “information” one can derive from the table is how many of each value there are. In other words, in this specific case, limiting query types results in no loss of utility other than that from the added noise.
This is a crucial point. The Dinur/Nissim system allows a type of query that is not helpful in learning statistically interesting information. Because the order of rows is not important, the fact that the Dinur/Nissim system allows the analyst to specify which rows he or she is interested in does not improve statistical understanding of the data.
A utility-first development process
Although our simple system debunks the idea that unlimited queries necessarily leads to unlimited privacy loss, the system itself is of course not very useful. A useful anonymization system must allow many types of queries.
And of course, the reconstruction attack is not the only type of attack one can use to re-identify individuals in a database. In the development of Diffix, we consider many kinds of attacks: difference attacks, averaging attacks, noise exploitation attacks, and side-channel attacks as well as reconstruction attacks and others.
The goal of good utility leads to many types of queries and unlimited queries, which in turn leads to a complex system that is hard to analyze formally. This has led to a structured but informal development process based on the following three principles:
- No known vulnerabilities. At all times, we strive to eliminate any known vulnerabilities. Of course there is always a lag between finding and fixing vulnerabilities, and some vulnerabilities just aren’t that serious and don’t need to be immediately fixed. Nevertheless, the basic stance is to eliminate all known vulnerabilities.
- Proactively find vulnerabilities. It is of course meaningless to eliminate known vulnerabilities if one makes no effort to find them. Besides our own constant internal vulnerability analyses, Diffix is openly published, and we established the first and still only bounty program for data anonymization.
- Utility first. Within the “no known vulnerabilities” framework, we take a utility-first stance. What this means is that we allow new types of queries so long as we are not able to find vulnerabilities. We constantly push the envelope of utility. When vulnerabilities with a given query type are discovered, we either find a fix that allows the query type, or disallow the query type. The multiple versions of Diffix (Diffix Aspen, Diffix Birch, Diffix Cedar, Diffix Dogwood…) reflect this development process.
Indeed our very definition of “strong anonymity” is where there are no publicly known vulnerabilities, and where privately known vulnerabilities are eliminated before they become publicly known. So far this approach has worked well. Our transparency and bounty programs have led to the discovery of several vulnerabilities (see bottom of this page) by external experts. In two cases the vulnerabilities were eliminated before being made public. In one case the vulnerability discovered internally before being discovered externally, and was fixed before public disclosure. One vulnerability was publicly disclosed at the same time it was disclosed to us, but it turned out to be a false alarm: the attack turned out not to work in practice.
Of course past experience does not guarantee future experience, but in the process we have gained considerable confidence in Diffix. So far the strength of Diffix’ anonymity has far exceeded the need for the environments where it is deployed. At the same time, we continue to satisfy demanding and sophisticated use cases in finance, telecommunications, and health.
Examples of Diffix mechanisms
A complete description of Diffix (latest version Dogwood) is outside the scope of this article. In total Diffix is a collection of mechanisms packaged in a dynamic query framework (Diffix operates query-by-query). Many of the mechanisms have been in use for decades, like suppressing answers with too few users or reducing the effect of extreme values. Other mechanisms, however, are novel. Here we highlight a couple examples of how Diffix uses both perturbation and query type limiting.
Sticky Layered Noise
It has been known since the early 80’s that noise is a necessary component of a dynamic query anonymization system. (The section “A Brief History of Anonymization” in this paper gives a good explanation.) Preventing repeated queries from eliminating noise is critical in allowing an unlimited number of queries. A key innovation of Diffix is a mechanism for doing this that is robust against a variety of attacks to circumvent it.
The mechanism is called sticky layered noise. As its name implies, it has two key aspects. First, the noise is sticky in the sense that the same query generates the same noise. In fact we saw an example of sticky in our simple example system. The mechanism there required that each distinct query be answered only once. This is to prevent simple averaging of noise values to eliminate the noise. If the example system answered each query with a new pseudo-random noise value, then the noise could be averaged away. An alternative to answering each distinct query once is to generate the same noise for the same query.
The second key aspect is layering. It turns out that it is not enough to add noise per query: one must add noise per query condition. (See here for an example of why this is the case.) For instance, if the query is “count the number of individuals with age 46, who smoke and have had lung cancer”, then three separate sticky noise values are needed, one for age, one for smoking, and one for lung cancer. If the same query condition is used in another query, then the same noise value is used in that condition. For instance, if another query is “count the number of women who smoke and have had lung cancer”, then two of the noise layers will be the same as in the first query.
Sticky layered noise is a key innovation in Diffix, but almost certainly a necessary component of any anonymization system that has the four properties of strong anonymity, good utility, ease-of-use, and many use cases.
Number Alignment in Ranges
Aircloak’s implementation of Diffix exposes an SQL interface. Aircloak does not allow the full SQL syntax. (See here for a description of these limitations.) For example, one of the limitations is that, when an analyst queries for a range of values (“weight between 150 and 170”), that range must fall within a predetermined alignment. For example, the analyst cannot request “weight between 150 and 170.0001”. Doing so would allow the analyst to generate multiple queries that strictly speaking differ, but ultimately don’t result in a different answer (“weight between 150 and 170.0002”, “weight between 150 and 170.0003”, etc.). Without number alignment, each such condition would create a different noise value and allow the analyst to average out the noise.
An example of utility-first
The reconstruction attack serves as a good example of the utility-first development process. Diffix Birch had a defense against the reconstruction attack. That defense assumed that the attacker would need to explicitly enumerate each member of the random set using the SQL IN() clause. Our defense was to require a layer of noise for each member, resulting in too much noise for the attack to work.
Cohen and Nissim discovered that they could circumvent this noise using a single complex math expression in a WHERE clause that effectively selected random individuals. If the attacker knows which values are used to identify the individuals, then the attacker knows which individuals are selected by the math expression. In the special case of densely packed identifier values, Cohen and Nissim could determine these values without prior knowledge of the values.
In deciding how to defend against attacks in general, we need to make a utility versus privacy decision. That decision is of course impacted by the severity of the attack. The Cohen/Nissim attack was given ‘moderate’ severity (the middle point on a 5-point scale from ‘very low’ to ‘very high’), and as well was deemed not a threat to our installed customer base at the time (the conditions for the attack did not exist).
For this attack, we clearly needed to place limitations on the math. We chose to do that only for columns that identify individuals. An example of such a column might be account number. This allowed other columns to benefit from rich math expressions while still protecting against the Cohen/Nissim attack. We also used an automated measurement process to determine which columns identify individuals. Any column where more than 80% of the values were unique to a single individual was labeled as a column that can identify individuals. The value of 80% was chosen because the Cohen/Nissim attack was relatively ineffective at that threshold. The fix was deployed in Diffix Cedar.
Without getting into detail (an announcement is forthcoming), another team discovered that a reconstruction attack was possible by exploiting columns that have many identifying values, but still fall below the 80% threshold. This attack was given a ‘low’ severity. The fix is to apply certain math limitations across the board. This turns out not to have much impact on utility anyway because none of our customers needed the math.
The story isn’t over. There are still other reconstruction attacks and fixes being explored. This process can have two outcomes. Either the utility lost from the fixes is acceptable, or it is unacceptable. I’m fairly confident that it will be the former, but nevertheless we need to see how it plays out.
In recent years there has been an explosion of new thinking and creativity in data anonymization, both for descriptive analytics like SQL and advanced analytics like machine learning. There are robust research and development programs in both formal and informal approaches to data anonymity. This is indeed an exciting time in the world of data anonymization.
Author’s note: This article, published on Aug. 17, 2020, is a revision of the article of the same title and URL published on Aug. 11, 2020. The revision is due to the insightful and helpful tweets of Frank McSherry, Aloni Cohen, Thomas Steinke, and Damien Desfontaines. In particular, the original article tried to cast Diffix as “type-limiting”, and Differential Privacy as “number-limiting” (and not type-limiting). These characterizations are both mis-leading and unhelpful.