Towards Usable Data Anonymity

When data is anonymized, part of its information is removed. This sometimes poses great challenges to analysts – how can the privacy of individuals be protected when sensitive data sets have to be analyzed?

by Paul Francis, Scientific Director, Max Planck Institute for Software Systems

The data anonymity problem — how to share data while protecting the individuals that comprise the data — has been intensely researched by computer scientists and statisticians for over 40 years. In spite of this, there exists no simple and general purpose way to anonymise data without severely compromising the usefulness of the data. 

As a result, most shared data today is poorly anonymized at best.

In this article, we explore why this problem is so hard and why current research efforts are making so little practical progress. In particular, we argue that current theoretical approaches are sacrificing the good for the perfect, that more empirical research is needed, and suggest a way forward.

 

Introduction

The data anonymization problem is intuitively simple: to get a good aggregate or statistical measure or model of data while preventing the ability to obtain information about the individuals that comprise the data.

When I tell friends and colleagues that this problem is hard and in the general case unsolved, a typical response is “But that sounds easy, just remove stuff like names and addresses.” They are in good company. In 2002 the state of Massachusetts released a medical dataset for research purposes that they “anonymized” by removing this kind of personally identifying information.

Recognizing, however, that the combination of zip code, birthdate, and gender usually uniquely identify an individual, Sweeney was able to identify the medical records of the Governor of Massachusetts in the dataset [21]. She was able to obtain this information about the Governor (and many others) from publicly available voter registration records.

One might reasonably suppose, then, that to anonymize data one should ensure that publicly available data should be generalized to the point that no individuals may be uniquely identified with that data. For instance HIPAA (the rules governing the release of medical data in the USA) suggests that only the first three digits of a zip code be revealed, and that where possible birth year is used instead of birthdate [2].

The amount of publicly available information about people, however, is constantly increasing. For instance, in 2006 Netflix released an “anonymized” dataset consisting of movie ratings for its movie recommendation competition, the Netflix Prize. The dataset consisted of only <userID, movieID, date, rating> tuples. In 2009, Narayanan and Shmatikov were able to identify individuals in the dataset by linking the records to publicly available IMDb ratings [19].

The Netflix De-Anonymization Incident

The de-anonymization of the Netflix dataset is one of the most often quoted incident in the data anonymization community. In 2006, Netflix published 10 million movie rankings by 500.000 customers, as part of a challenge for people to come up with better recommendation system than the one the company was using.

The data was anonymized by removing personal details and replacing names with random numbers, to protect the privacy of the recommenders. Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, de-anonymized some of the Netflix data by comparing rankings and timestamps with public information in the Internet Movie Database (IMDb)

Although this incident happened over 10 years ago, it vividly shows what happens when a dataset is released that is only pseudonymized and not properly anonymized. In the post-GDPR-era this could have easily led to a multi-million-dollar fine.

 

Many people are surprised both at how little non-personal information it takes to uniquely identify individuals, and at how much publicly available information there is that can be used to tie that non-personal information back to us. This misunderstanding is so prevalent that the European Union General Data Protection Regulation (GDPR) makes an important distinction between anonymized data, and data that is merely pseudonymized by the simple removal of personally identifying information.

GDPR says that personal data must be protected in a variety of ways: only collect and store what data is needed, get informed consent, give users the ability to have their data deleted, and so on. GDPR considers pseudonymized data to be personal data even if it appears that there is no way to identify individuals from the pseudonymized data. By contrast, anonymized data is not regarded as personal and does not have to be protected [22].

To be considered as anonymous, GDPR requires that there be no way to identify individuals in the anonymized data. GDPR, however, doesn’t suggest how to anonymize data, and in fact GDPR doesn’t even suggest how to know when data is or is not anonymized. This is not GDPR’s fault: the fact is that there exists no anonymization mechanism that satisfies most analytic needs — easy to configure and use, very little data distortion, rich query semantics and types of analysis, no limits on the number of queries, and so on. This in spite of the fact that the anonymization problem has been an active area of research for over 40 years.

In this article, I explore why data anonymization is so hard, why computer scientists have made so little practical progress in recent years, and how we are approaching the problem differently, leading to substantial improvements in usability while retaining strong anonymity. In particular, we describe the key concepts behind our own data anonymization system Diffix.

About the Author

Paul Francis

Paul Francis is a tenured faculty at the Max Planck Institute for Software Systems in Germany and Scientific Advisor for Aircloak. Paul has held research positions at Cornell University, ACIRI, NTT Software Labs, Bellcore, MITRE, and was Chief Scientist at two Silicon Valley startups. In the past, Paul’s research centered around routing and addressing problems in the Internet and P2P networks. In recent years, his research has focused on online privacy with a focus on anonymized analytics.

francis@mpi-sws.org
https://francis.mpi-sws.org/
https://www.mpi-sws.org/

Newsletter

To stay up to date

Subscribe to our low volume, no bullshit, mailing list to stay up to date with the latest information about Aircloak, Aircloak Insights and our approach to anonymization.


A Brief History of Anonymization

As early as the mid-1800’s, confidentiality of individuals in the U.S. census became a concern [17]. The census bureau for instance started removing personally identifying information from publicly available census data.

Over the ensuing decades, the bureau increasingly used a variety of techniques to mitigate the possibility that released data would allow individuals in the data to be identified. These techniques include rounding, adding random noise to values, aggregation, cell suppression, cell swapping, and sampling among others [17].

In the 1950’s, the bureau started using computers to tabulate data, and by the 1960’s anonymization techniques like those described above were being automated [17]. Computers introduced the ability for analysts to “cross-tabulate” data (set filter conditions on queries). This tremendously increased an analyst’s ability to analyze the data, but also opened the possibility that an analyst could learn about an individual by specifying a set of query conditions that uniquely identify that individual.

A Census Bureau technician monitors the scanning of 1960 census questionnaires by FOSDIC. (Source: https://en.wikipedia.org/wiki/FOSDIC)

For instance, suppose that an analyst happens to know the birth day, zip code, and gender of someone (the victim). The analyst could query the data for the number of people with that birthday, zip code, and gender. If the answer is 1, then the analyst knows that the victim is indeed the only individual in the dataset with those attributes. We refer to a set of attributes that are unique to an individual as a pseudo-identifier. As a shorthand, let us refer to a pseudo-identifier as PsID. So here the PsID is birthday AND zip code AND gender.

Given the PsID, the analyst can then learn anything else about the individual. For instance, the analyst could ask for the number of people with that PsID and a salary between $100K and $110K. If the answer is 1, the analyst knows roughly the victim’s salary. The obvious but as it turns out naive fix is to suppress answers that pertain to only a few individuals, or that pertain to all but a few individuals (so that an analyst can’t ask about everyone except the pseudo-identified individual).

A series of papers published in the late 70’s, culminating in the Denning et.al. paper on so-called general trackers [7], showed that this defense fails spectacularly. One can for all practical purposes reconstruct the complete dataset from a system deploying this defense.

By way of example, suppose that the analyst wants to learn the salary of a colleague. The attack requires that the analyst knows the total number of individuals in the dataset N, but this can be learned by taking the sum of two queries with mutually exclusive answer sets, such as the number of males and the number of non-males. The first step is to see if there is a pseudo-identifer for the colleague. Assuming N is known, this can be done with two queries producing answers N1 and N2 as:

(1) N1: The number of individuals that match PsID OR are male
(2) N2: The number of individuals that match PsID OR are NOT male

The analyst then computes N1+N2-N, and if the answer is 1, then PsID is a pseudo-identifier. Once this is confirmed, the analyst can learn the victim’s salary as follows. First, determine the sum of all salaries S by querying separately for the sum of salaries of all males and the sum of salaries of all non-males, and adding the answers. Then make the following two queries:

(1) S1: The sum of salaries for individuals that match PsID OR are male
(2) S2: The sum of salaries for individuals that match PsID OR are NOT male

The victim’s salary is then S1+S2-S. Note that the attack cannot be prevented simply by eliminating negation and unions from the query languange. Negation can for instance be emulated by enumerating all values other than the one being negated (in this case female instead of NOT male), and unions can be emulated with intersections (A OR B is the same as A plus B minus A AND B).

The next obvious and maybe less naive idea is to add noise to numeric answers so that the math doesn’t give an accurate answer. For instance, on queries that count the number of individuals, the anonymizing system could add a random integer uniformly chosen between say -2 and +2. Queries that for instance take a sum of some value like salary could add noise proportional in magnitude to the largest value.

The earliest publication we could find that proposes adding noise is from the statistician Fellegi in 1972 [13]. This was in response to an attack that Fellegi discovered that was similar to though less general than Denning’s attack described above.

The illustration shows the most influencial scientific papers of the last decades that led to the development of Diffix

Of course noise must be unpredictable and therefore random. In the context of a query-based system (versus a data release system as for instance with census data) this creates a difficulty. If each answer produced has a new zero-mean random noise sample, then an analyst need only repeat the query multiple times and take the average to effectively eliminate the noise.

Denning et.al. recognized this problem, and in 1980 proposed seeding the random number generator with values taken from the query itself [5]. The idea here is that the same query would then produce the same noise, and averaging wouldn’t work. We refer to this general concept of “same-query-same-noise” as sticky noise, and indeed it is a key component of our own system Diffix.

A problem with Denning’s sticky noise solution, which indeed the authors recognized, is that the analyst can still average out the noise by generating multiple different queries that all produce the same result. This could be done for instance by adding filter conditions that don’t effect the answer, for instance individuals with age less than 1000.

Indeed it is not even necessary to produce queries with the same result. For instance, consider two queries requesting the count of individuals, one with the conditions [gender=‘M’ AND age=30], and one with the conditions [gender=‘M’ AND age!=30]. The sum of the answers is equivalent to a single query with the condition [gender=‘M’]. Another pair of queries, this time with [gender=‘M’ AND age=31] and [gender=‘M’ AND age!=31] is also equivalent to the single query with [gender=‘M’]. Each query, however, is different and therefore results in different random noise. The noise can be averaged away with enough such pairs.

Indeed it is this kind of attack that led Denning in 1981 to say

“It seems unlikely that we will ever be able to efficiently decide at the time of a query whether a requested statistic could lead to the disclosure of a sensitive statistic.” [6]

Denning never further explored this problem, and essentially stopped working on data anonymity altogether.

In fact, by the mid 1980’s the entire computer science (CS) community had pretty much lost interest in data anonymity.

I presume that the reasons are in part due to the difficulty of the problem, but primarily because the research community, especially security and databases, had far more pressing problems to work on.

By contrast the statistics research community continued to work on data anonymity, mainly in the context of data release (e.g. census data). A variety of mechanisms were developed that focused less on anonymity per se, and more on understanding and controling the resulting statistical properties.

These include cell swapping (literally swapping values between individuals), top- and bottom-coding (making extreme values less extreme), and generalization (bucketing or categorizing values) among other things [18]. These mechanisms from what I’ll call the “statistics toolkit” tend to require substantial effort and expertise: the structure of the data and the desired analytic outcome need to be carefully considered, and even then the strength of the resulting anonymity is usually not well understood.

Fast forward around 15 years, data anonymization becomes a hot topic in Computer Science again. This may in large part be due to the stunning demonstration of re-identification by Sweeney mentioned earlier [21]. In the same paper, Sweeney presented a formal model for K-anonymity. The axiom is that if all individuals are indistinguishable from K-1 other individuals, then anonymity is achieved. Higher K means more anonymous. A dataset satisfying K-anonymity can be generated generalizing attributes, for instance using the first 3 digits of the zip code instead of all 5, until the model is satisfied for that K.

The figure illustrates K-anonymity. In the left is a snippet of an original database table A with three columns, birthdate, zip code, and salary. Assume there are potentially thousands of additional rows not shown. On the right is a snippet of the K-anonymized table produced from table A. In each case, in order to achieve K-anonymity, birthday is generalized to the month and year, zip code is generalized to the first 2 digits, and salary is generalized to buckets of $10K dollars. For each combination of the resulting three values, there are at least K=4 users.

In 2003, Dinur and Nissim published a paper that arguably has had a more profound impact on how CS researchers think about data anonymity than any other work [9]. To simplify and paraphrase, Dinur-Nissim showed that even without ever repeating the same query, adding enough noise to defeat the general tracker as described above may still not prevent reconstruction of the original data. They showed that, in a setting where an analyst can select which users contribute to a noisy sum of attribute values, that the amount of noise must exceed some threshold, and/or the number of queries made must be limited.

The basic idea of the attack is to generate queries where each query selects a different set of users, and where each user is represented roughly half the time. Selecting users randomly achieves this. Each query can be formulated as an inequality, where each users’ value is an unknown variable contributing to the noisy sum. As long as the amount of noise for each sum is small enough and the number of inequalities is large enough, the resulting set of inequalities can be solved.

Dinur-Nissim didn’t merely add a new attack to the catalog, it suggested a new general way of thinking about anonymity. Namely, if there is enough noise, or few enough queries, then independent of what the queries are or what the attacker knows, anonymity can be achieved. Furthermore, early models in this direction suggested that the approach might be practical [4].

This line of inquiry led in 2006 to Dwork’s differential privacy [10,11]. The axiom here is that, given two datasets that differ by one record, if the two datasets are statistically indistinguishable, then anonymity is achieved. Simplifying here, the measure is a single parameter ϵ. The smaller the ϵ, the less statistical certainty, and the stronger the anonymity.

A key aspect of differential privacy is composability: any new information that is derived from the dataset adds to the value of ϵ. New information with less noise adds more to ϵ and vice versa. Ultimately, to ensure that a given target value of ϵ (the privacy budget) is not exceeded, a point is reached after which no new information can be derived (i.e. no more queries).

Differential privacy adds noise (typically Laplacian) to answers so that two conceptual databases that differ by one row are statistically indistinguishable from each other. The illustration shows a data flow with and without differential privacy approaches. People are questioned whether they drink alcohol or not. Analyses are then derived from the survey results. If the data set is only pseudonmized, individual persons can still be identified on the basis of their age.

At the risk of overlooking a substantial amount of research (and in the interests of keeping the brief history brief), I would say that the K-anonymity and differential privacy are the dominant models in data anonymity research, and have been for over a decade.

Big in Research, Small in Practice

In spite of the dominance of K-anonymity and differential privacy in research, they have limited use in practice.

K-anonymity does not work well with high dimensional data (tables with many attribute columns). The more attributes, the more generalized the attribute values must be in order to generate groups of K indistinguishable users [3]. K-anonymity can be useful for data with very few columns (for instance only location and time). K-anonymity can also be used on a subset of columns, for instance those containing attributes likely to be publicly available. The resulting dataset, however, must still be regarded as pseudonymized.

Differential privacy also has poor usability. This statement may surprise some readers who have seen press announcements from Apple, Google, and Uber regarding their use of differential privacy. Deployments that honor composability and a value of ϵ that would generally be regarded as strongly anonymous (i.e. ϵ < 1) allow for only a small number of queries, maybe a few 10s. Some mechanisms, for instance Google’s [12] and Apple’s [8], manage to avoid composability by making assumptions about the lack of correlation between between attributes and for the same attribute over time. As a result, the number of queries they can make are unlimited, but the noise is high (errors in the 100s or 1000s for counts of users), and the ability to for instance observe correlations between attributes is lost. While these drawbacks are acceptable for the specific applications in Google and Apple, most analytic tasks require less noise and more query semantics.

Differential privacy may also be usable in certain machine learning applications, where the goals of adding noise to improve privacy and lowering fidelity in the training set to avoid over-fitting are complementary [20].

How Can We Move Forward?

Given the current state-of-the-art in anonymization, practitioners work from the following spectrum of choices:

  • Simply pseudonymize the data.
  • Apply mechanisms from the “statistics toolkit” as needed to find a balance between the strength of anonymity and the quality of statistical outcomes.
  • Apply an axiom-based mechanism like K-anonymity or differential privacy.

The first choice has two big advantages: it is simple and there is little or no degradation in data quality. While it is not anonymous, it does provide substantially more protection than doing nothing. The GDPR encourages pseudonymizing when there is no other option. The last choice also has the advantage of simplicity, but the resulting data quality is generally not acceptable. The middle choice requires substantial expertise, effort, and care to get right. The three choices are given in order of best to worst analytic quality, and not surprisingly the order is also from most to least commonly used.

The goal of our research is to find a solution to the data anonymity problem that is easy for non-experts to configure and use but nevertheless is anonymous and works for a wide range of data types and analytic tasks.

This goal begs two big questions. First, what makes us think that we can make substantial new progress on this problem after nearly 50 years of active research? Second, how will we know whether we’ve succeeded? In particular, whether the analytics is useful enough and whether the system is anonymous enough.

Regarding the second question, we have from the beginning approached this problem entrepreneurially. In fact, it is fair to say that the most concise measure of success for us has been whether the solution is good enough that organizations will pay money for it — enough to sustain a business. In short, rather than decide as a mostly closed research community what the metrics of success are, we’ve chosen to let the users tell us in the most meaningful way.

A legitimate concern with this approach is that commercial users may not adequately care about the strength of the anonymity. In our experience this is partially true, but in practice the consensus of the privacy community, including Data Protection Authorities (DPA) and researchers, is an important criteria for commercial users. Therefore we find that the entrepreneurial approach leads to a good measure of anonymity.

Regarding the first question, the answer early on was probably “irrational optimism”. Nevertheless after five years of sustained effort and several wrong turns, I believe we’ve finally had some conceptual breakthroughs, and can see our way towards more progress.

In retrospect, I think that the entrepreneurial approach has been the key to allowing us to find solutions. It has allowed us to shed requirements that are not critically important to users, but which would have weakened anonymity had we kept them. Foremost among the shed requirements is that of requiring a formal guarantee. Few if any users require one, and by not insisting on one we expand the solution space to include a variety of mechanisms that might otherwise be too complex for formal analysis.

The downside, however, is clear: we don’t know if our solution has unknown flaws. Or almost certainly more correctly, we don’t know what the flaws of our system are. To mitigate against this, we have designed a measure of anonymity and use the measure in an active bounty program that encourages others to find the flaws in our system. More generally, however, our design approach is one of initially allowing some given feature that we believe to be safe, later finding an attack that exploits the feature, and then either disabling the  feature if it is not important or designing a defense against the attack if it is important. In the long run, we fail if the resulting set of features is not useful to users, and succeed if it is. I understand that in many respects this is not the ideal design approach. We’d love for there to be some anonymity axiom that allows for good usability, but there is no known such axiom and I even wonder if one exists at all.

A good example of this process can be taken from the Dinur-Nissim paper [9]. This attack requires that the attacker be able to select sets of individuals in the dataset, and to know which individ- uals have been selected. The differentially private solution to this attack (and any other attack, known or unknown) is to add noise and limit queries. Another solution, though less general, is to prevent the attacker from selecting sets of individuals in the first place. Then the question is, is it important to analysts to be able to select sets of individuals? Given that the very nature of data anonymity is to expose aggregates but hide individuals, there is no expectation among analysts that they should be able to select individuals. This is therefore a case where we can prevent an attack with little impact on usability.

Diffix, Key Concepts

Our design is called Diffix. Diffix is developed as a joint research project of MPI-SWS and Aircloak which offers it as a commercial product.

The details of the design have evolved over the last couple years, both as new vulnerabilities are found and fixed, and as we add features, but the key concepts have persisted. In this section, we give a partial description of Diffix that nevertheless conveys the key intuitions. A complete description of Diffix, specifically the version used for the first round of the bounty program (Dec. 2017 – May 2018), is cited here [15]. The next section discusses the bounty program and our criteria for measuring anonymity.

The first key design decision was whether to do a whole-dataset anonymization as with K-anonymity, or to do query-by-query anonymization, as typically found with differential privacy. Early on, we made the decision to go with a query-by-query approach. I won’t say that this decision was made based on some kind of principled analysis. Rather, a whole-dataset approach seemed untenable to us. To be general, a whole-dataset approach would have to somehow defend against any possible query that might be made on the anonymized dataset.

Diffix is deployed between an unmodified database and the analyst or application that queries the data.

We couldn’t think of any way to deal with this. With query-by-query, there was at least the possibility that the system could adapt dynamically to whatever queries were being made. We perceived, correctly I believe, that this would give us more design options.

Diffix is a system that is deployed between an unmodified database and the analyst or application that queries the data (Figure 2). Diffix exposes an SQL interface to the analyst. Diffix internally models the database as a relational database with tables, columns, and rows, though the database does not need to be SQL per se. Diffix receives an SQL query from the analyst, interprets it and sometimes slightly modifies it, requests the appropriate data from the database, sometimes slightly modifies the data, computes the requested statistical function, and adds noise and conveys the answer to the analyst.

The decision to add noise to answers necessarily follows from the need to defend against attacks like the general tracker mentioned earlier.

From the beginning, however, it was clear to us that a limit on the number of queries that could be made (i.e. the differential privacy budget) was a non-starter.

Therefore we saw no choice but to base our solution on the principle of sticky noise suggested by Denning nearly 40 years ago [5].

In short, either we find a way to make sticky noise work, or we give up. Every subsequent design decision followed from this premise.

Recall that Denning tried to prevent an attacker from averaging away the noise by seeding the noise generating function using the contents of the entire query. The problem was that of defending against attacks that use different queries or groups of queries that nevertheless generate the same answer. For example the attack from earlier that used pairs of queries [gender=‘M’ AND age=30] and [gender=‘M’ AND age!=30], [gender=‘M’ AND age=31] and [gender=‘M’ AND age!=31], etc., to average away the random noise.

Diffix defends against this using what we call layers of sticky noise. For each condition in the query, Diffix adds a separate sticky noise sample. The noise is Gaussian and the noise samples are summed: queries with more conditions have more noise. A query with the conditions [gender=‘M’ AND age=30] would have two noise samples, one for each condition. In this attack, the noise layers for the [age=XX] conditions could indeed be averaged away, but the sticky noise from [gender=‘M’] persists and is not averaged away. As a result, the attacker would not know the true count of individuals, and therefore could not for instance fun a tracker attack.

The seed itself is derived from the semantics of the condition. So for instance for the condition [age=36], the seed consists of a one-way hash of the symbols ‘age’, ‘36’, ‘=’, and a secret salt (a large hard-to-guess random value). Diffix allows more complex conditions, for instance [sqrt(age)=6], and must recognize the semantics of the conditions so that an attacker cannot cause what is effectively the same condition to generate different seeds. If Diffix cannot determine the semantics of the condition from inspection, then it does what we call floating the column. Diffix requests the column values (in this case age) from the database and uses the returned values themselves in the seed. In the case of [sqrt(age)=6], the database would return age values of 36, and the seed components would therefore again be ‘age’, ‘36’, ‘-’, and salt.

When Diffix determines the amount of noise to add per layer (the standard deviation of the Guassian sample) it needs to take into account how much individuals contribute to the answer. For instance, suppose that the analyst requests the sum of salaries. The higher salaries can dominate the sum, and so to hide the impact of the higher salaries, and therefore protect the corresponding individuals, Diffix adds an amount of noise proportional to the highest salaries. Diffix may also effectively modify extreme values to be similar to a set of less extreme values. This is done to prevent the amount of noise itself from signaling the presence or absence of an extreme value in the answer.

In order for Diffix to estimate how much any individual contributes to an answer, Diffix must be able to see how individual users influence the data. To do this, Diffix requires that each individual in the raw database has a single unique identifier (UID) associated with it. An administrator must configure Diffix to recognize which columns contain the UIDs. This is, however, the only required configuration. In particular, Diffix does not require knowledge of which columns may or may not be personally identifying, which columns are more sensitive, and so on.

A useful feature of any analytics system is the ability to discover unknown attribute values. For instance, the attribute ‘URL visited’ may contain many URLs that the analyst does not know about. Diffix cannot, however, simply list all attribute values. A URL, for instance, may contain personally identifiable and sensitive information. To allow for attribute discovery while defending against the release of individual information, Diffix has a feature called low-count filtering. Diffix does not reveal a value unless a threshold number of distinct individuals have that value. The threshold itself is a noisy threshold: a value taken from a Gaussian distribution using the layered sticky noise seeding mechanism.

Diffix supports a substantial portion of SQL, including JOINs, nested sub-queries with aggregation functions (sum, count, max, etc.), a variety of math and string functions, casting, and condition operators such as LIKE, IN, and BETWEEN. Diffix can also work with all standard SQL data types: text, datetimes, ints and reals, and booleans. Nevertheless, Diffix also imposes substantial limitations on the SQL it accepts. The allowed SQL is certainly not Turing complete. For instance, in order to prevent an attacker from emulating conditional logic using only math, Diffix limits the complexity of math functions.

As another example of an SQL limitation, Diffix limits the granularity of inequalities (i.e. [age>=0 AND age<=58]). This is done to prevent an attacker from removing the noise associated with an inequality by incrementally changing the boundaries. For instance, suppose that the attacker wanted an exact count of individuals aged 30. He or she could make a set of queries with conditions [age>=29.9 AND age<=30.1], [age>=29.91 AND age<=30.11], [age>=29.92 AND age<=30.12] etc. Each condition has different semantics, and would produce a different noise seed, but selects the same individuals. To mitigate this, Diffix requires that inequalities must be ranges rather than open-sided, and that the ranges conform to a pre-defined set of sizes and offsets (e.g. …, 0.1, 0.2, 0.5, 1, 2, 5, …). We call this snapped alignment.

The basic layered sticky noise described so far is not the whole story. There are a class of attacks called difference attacks that exploit sticky layered noise, and require still additional mechanism to defend. A difference attack is one where two queries are comprised of either the same set of individuals, or differ by exactly one individual (the victim). In certain scenarios, the attacker can detect which is the case based on the difference between the two answers.

As an example, suppose that the attacker knows that there is only one woman in the CS department. Two queries, one with the conditions [age=30 AND gender=‘M’], and another with the condition [age=30] will differ by one individual (the woman) if her age is 30, and will not differ otherwise. For this single pair of queries, the noise layer associated with [gender=‘M’] will probably cause the two answers to differ whether or not the woman is aged 30. If, however, the attacker makes a similar pair of queries for each age (31, 32, etc.), then the [gender=‘M’] noise layer will be the same for each such pair, and so the woman’s age can be detected as being in the pair whose difference is different from the other pairs.

To defend against this, Diffix adds a second content-based noise layer whose seed is derived not just from the condition components, but also from the number of distinct individuals that comprise the answer. The seed components for the content-based noise layer for [gender=‘M’] are ‘gender’, ‘=’, ‘M’, salt, and count(distinct UIDs). As a result, the content-based noise layer will differ for each instance of [gender=‘M’], and so each pair in the attack will differ by a different amount, making the attacker uncertain as to the woman’s age.

There is another more sophisticated variant of the difference attack attributed to Gadotti et.al. [16]. Suppose that the attacker has a PsID (pseudo-identifier) for the victim, for instance [bday=‘1957-12-14’ AND zip=12345]. The attacker may then exploit the PsID to learn the value of another attribute, say party affiliation ([party=‘dem’]). To do this, the attacker generates a series of query pairs where, for each pair, one of the PsID attributes is removed or negated:

Pair 1 [bday!=‘1957-12-14’ AND zip=12345 AND party=‘dem’]
[zip=12345 AND party=‘dem’]
Pair 2 [bday=‘1957-12-14’ AND zip!=12345 AND party=‘dem’]
[bday=‘1957-12-14’ AND party=‘dem’]

Each pair differs by one individual if the victim’s party is democrat, and are the same otherwise. Given an increasing number of sample pairs, the attacker’s confidence as to whether the victim is present or absent increases. If the PsID has four or five attributes, the confidence can be quite high. As it so happens, finding a PsID with enough attributes where the individual queries are not low-count filtered is quite rare. This is fortunate since Gadotti et.al. chose to publicly disclose the attack before informing Aircloak.

Nevertheless, the attack can be prevented through a mechanism that detects and drops low-effect conditions. A low-effect condition is one whose presence has a very small effect on the set of distinct individuals that comprise an answer. In this attack, the negating condition (i.e. [zip!=12345]) has either no effect or changes the answer by one individual. Either way, the effect is small and the condition is dropped from the query. As a result, both queries in the pair are identical and the attacker learns nothing.

An Example of Diffix

The figure below illustrates the mechanisms described in this section for a query that requests the sum of all salaries in a dataset, conditioned on [dept=‘CS’], [age!=1000], and [weight>=91 AND weight<96]. This is expressed as SQL in the figure.

In the first step, Diffix snaps the range to the pre-defined alignment. As a result, the range is modified from [weight>=91 AND weight<96] to [weight>=90 AND weight<95]. Next, Diffix checks the negative condition [age!=1000] to determine if it is low-effect. This condition is, and so it is removed from the query. The analyst can safely be informed that both of these modifications have taken place.

The analyst has requested the sum of salaries (SELECT sum(salary)), but Diffix must compute a noisy sum. Diffix must also determine the number of distinct UIDs in order to seed the noise layers as well as to compute the low-count threshold. In order to compute the amount of noise, Diffix needs to know how much users with the highest salaries contribute to the sum. To do this, Diffix ask the database to compute various statistics about the salary, namely the min, max, average, and standard deviation.

This query has four sticky noise layers, two for each of the conditions. The two content-based noise layers are seeded identically to their query-based counterparts, but with the addition of the number of distinct UIDs (#uids). A Pseudo-Random Number Generator (PRNG) is executed once for each of the four seeds to produce four noise values. These values are summed together to produce the composite noise.

Using the composite noise as the low-count noisy threshold, the count of distinct UIDs are used to determine if the answer should be suppressed. If not, then the salaries, salary statistics, and composite noise are used to compute a noisy sum which is returned to the analyst.

Evaluating Anonymity

One of the challenges we face as a startup is that of convincing others and ourselves that Diffix is indeed anonymous. Axiom-based anonymity schemes can provide a good measure: for instance for differential privacy ϵ 1 is generally regarded as strongly anonymous.

The majority of practical anonymization deployments, including Diffix, are however not axiom-based and typically have no such measure.

There are to my knowledge no organizations that can provide a certification for anonymization. National DPAs are from time to time asked to approve some narrowly-scoped data anonymization (a particular dataset for a particular use case), but the process is ad hoc. The French national DPA, CNIL, which in my opinion is one of if not the best national DPA, offers opinions on anonymity. These opinions, however, do not state that a given scheme is anonymous. Rather, they only go so far as to state that they do not find a problem with a given scheme. This stems from the sensible recognition that claiming anonymity with certainty is very difficult.

If we accept that axiom-based anonymization is rarely practical, then we need to accept that there will be some uncertainty regarding degree of anonymity, and we need to explore ways to minimize that uncertainty. For the foreseeable future, DPAs won’t be in a position to do this. Towards this end, we launched a bounty program for anonymity — the first of its kind. The first phase of the program [14] ran 6 months, had a $15,000 prize budget, with a maximum prize of $5000. We provide a variety of public datasets that may be attacked, or participants may provide their own. Attackers may choose to assume more or less prior knowledge about the data. In attacking Diffix, attackers may make as many queries as they wish. The prize amount is based on how much an attacker can learn about data in a dataset, and how much prior knowledge is needed.

After an attack, the attacker makes a claim — a true/false statement — about the data. The claim is either correct or not. The higher the percentage of correct claims, the better the attack. In particular, we are interested in the improvement the attacker can make over a statistical guess. For example, if the attacker is making claims about gender, 50% of individuals are female, and 80% of the attacker’s claims are correct, then the attacker has demonstrated a 60% improvement over a statistical guess. We measure how much prior knowledge the attacker requires as the ratio of data cells the attacker needs to know to data cells the attacker learns.

The specific claims allowed by the bounty program are based on the three EU criteria for anonymity [1], namely singling-out, linkability, and inference. These criteria are meant as a qualitative guideline for DPAs and others to evaluate anonymity. To give an example, the EU describes singling-out as “the possibility to isolate some or all records which identify an individual in the dataset”. The corresponding true/false claim is of the form “there is a single individual in the dataset with the following attributes.” For instance, one individual with attributes [gender=‘F’,zip=12345,dob=‘17-04-1992’]. If there is exactly one individual with those attributes, the claim is correct. Otherwise it is incorrect. The corresponding claims for linkability and inference can be found in [14].

I wish to be clear here: a bounty program is not a substitute for a proven guarantee, and I don’t mean to imply that it is. It can find vulnerabilities, but it cannot prove the absence of vulnerabilities. Nevertheless, a bounty program can potentially provide some evidence as to the likelihood that unknown vulnerabilities exist. The more qualified people that seriously search for vulnerabilities, the less likely it is that unknown vulnerabilities exist, and the less likely that malicious individuals will find new vulnerabilities.

Because the bounty program requires that participants register, it is possible to learn, for instance through follow-up surveys, how much effort was spent looking for vulnerabilities and how qualified participants are. By contrast, in the standard research model of publish and cite, it is hard to know how much effort went into failed attempts at finding vulnerabilities since the failed attempts are not published.

The Aircloak Attack Challenge Mascot

In the case of our bounty program, 33 participants registered for an attack, but only two ultimately carried out attacks.We have yet to survey the other 31 to find out how much effort, if any, they put into thinking of attacks. The bounty program was a success in-so-far-as new attacks were found. We fortunately know how to defend against the successful attacks, and were pleased that the full budget was not consumed. We expect to continue running the bounty program with future releases.

Conclusion and Next Steps

In this article, I have described why strong anonymization with good usability is so hard, argued that in order to make progress the research community needs to complement the current focus on formal approaches with empirical approaches, described our empirical- and even entrepreneurial-based research project, and presented the anonymization technology we have developed, Diffix.

In spite of the fact that Diffix is the basis for a commercial product, I regard Diffix as very much a work in progress. Given the usability of Diffix, I think we have achieved a remarkable strength of anonymization.

Nevertheless, continuous work, from as diverse a community of researchers and hackers as possible, is required to gain an understanding of and confidence in the anonymization properties of Diffix. The bounty program is one step in that direction, but it is already the case that the money reward does not really justify the effort involved: the attackers from the first round were more motivated by academic interest than financial. With luck, academic focus on Diffix and other new empirical approaches will grow.

Although compared to current anonymization techniques Diffix has exceptionally good usability, when compared to non-anonymized analytics, Diffix is still quite rudimentary. Besides the SQL limitations already discussed, Diffix has only very basic statistics functions (count, sum, average, standard deviation, min, max, and median). While quite useful, it is clear that both more statistical operations are required, as well as advanced analytics like machine learning.

We wish and expect to remove many of the existing SQL limitations. We are also actively working on expanding the types of analytics that Diffix supports. Unfortunately each such change modifies the anonymization properties and opens the possibility of new attacks. While I believe that we are getting better at predicting what kinds of techniques are safe, we can expect occasional setbacks as we improve usability and strengthen anonymity. Certainly for the immediate future, Diffix should be used only in controlled circumstances, with for instance proper access control and usage logging.

Once Diffix is better recognized by the research community, I would love to see progress on building a formal understanding of its anonymization properties. I don’t have a good sense of how feasible this is. In any event, the focus of our research will continue to be empirical in nature as we continue to improve usability. As a final thought, I’d like to convey a comment recently shared with me by a prominent privacy researcher:

“I am doubtful you guys are going to win this game without limiting the number and type of queries. Just from basic coding theory, it seems that every noisy answer is a kind of codeword, and if you give enough answers, it should be possible to decode for the underlying values.”

This seems to me to be a central question. It may be true in the extreme, but it may well be the case that relatively painless limits on the type of queries can make such a decoding impossible or at least computationally infeasible in the way that a strong crypto algorithm with a large key is computationally infeasible.

I would love the see the theoretical research community focus more on the issue of query type limitations and rather than the almost exclusive focus on limiting the number of queries. Even if it does turn out that Diffix eventually runs up against a number-of-queries limit, if the limit is large enough it may not matter practically speaking.

What I certainly believe to be true, however, is that data anonymity will be a vibrant and exciting area of research for the foreseeable future.

References

[1] [n. d.]. Article 29 Data Protection Working Party Opinion 05/2014 on Anonymisation Techniques.
[2] [n. d.]. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule . http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
[3] Charu C. Aggarwal. 2005. On K-anonymity and the Curse of Dimensionality. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB ’05). VLDB Endowment, 901–909.
http://dl.acm.org/citation.cfm?id=1083592.1083696
[4] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. 2005. Practical privacy: the SuLQ framework. In PODS. 128–138.
[5] Dorothy E Denning. 1980. Secure statistical databases with random sample queries. ACM Transactions on Database Systems (TODS) 5, 3 (1980), 291–315.
[6] Dorothy E. Denning. 1981. Restricting Queries that Might Lead to Compromise. In 1981 IEEE Symposium on Security and Privacy, Oakland, CA, USA, April 27-29, 1981. 33–40. https://doi.org/10.1109/SP.1981.10000
[7] Dorothy E. Denning and Peter J. Denning. 1979. The Tracker: A Threat to Statistical Database Security. ACM Trans. Database Syst. 4, 1 (March 1979), 76–96. https://doi.org/10.1145/320064.320069
[8] Apple Differential Privacy Team. [n. d.]. Learning with Privacy at Scale. https://machinelearning.apple.com/docs/learning-with-privacy-at-scale/appledifferentialprivacysystem.pdf.
[9] Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 202–210.
[10] Cynthia Dwork. 2006. Differential Privacy. In ICALP.
[11] Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In TAMC. 1–19.
[12] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. ACM, 1054–1067.
[13] Ivan P Fellegi. 1972. On the question of statistical confidentiality. J. Amer. Statist. Assoc. 67, 337 (1972), 7–18.
[14] Paul Francis. 2018. Procedures and Rules for Phase 1 of the Diffix Bounty Program. Technical Report MPI-SWS-2018-007. MPI-SWS.
[15] Paul Francis, Sebastian Probst-Eide, Pawel Obrok, Cristian Berneanu, Sasa Juric, and Reinhard Munz. 2018. Extended Diffix. arXiv:arXiv:1806.02075
[16] Andrea Gadotti, Florimond Houssiau, Luc Rocher, and Yves-Alexandre de Montjoye. 2018. When the signal is in the noise: The limits of Diffix’s sticky noise. CoRR abs/1804.06752 (2018). arXiv:1804.06752 http://arxiv.org/abs/1804.06752
[17] George Gatewood. 2001. A Monograph on Confidentiality and Privacy in the U.S. Census. https://www.census.gov/history/pdf/ConfidentialityMonograph.pdf.
[18] Gregory J. Matthews and Ofer Harel. 2011. Data confidentiality: A review of methods for statistical disclosure limitation and methods for assessing privacy. Statistics Surveys 5 (2011), 1–29. https://doi.org/10.1214/11-SS074
[19] Arvind Narayanan and Vitaly Shmatikov. 2008. Robust De-anonymization of Large Sparse Datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP ’08). IEEE Computer Society, Washington, DC, USA, 111–125. https://doi.org/10.1109/SP.2008.33
[20] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. 2018. Scalable Private Learning with PATE. CoRR abs/1802.08908 (2018). arXiv:1802.08908 http://arxiv.org/abs/1802.08908
[21] Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 05 (2002), 557–570.
[22] European Union. 2009. What is personal data? https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-personal-data_en.

Ready to see what Aircloak can do for you?

SCHEDULE A DEMO NOW

Thank you for your interest in Aircloak Insights.
Please send us your contact details and we will be in touch to schedule a demo shortly.
I agree to the terms of your privacy policy.