Building a Privacy-preserving Analytics Stack
Here you can find a comprehensive guide to build a modern privacy-preserving analytics stack. It is targeted mainly at data analysts and senior managers who want to better understand how to comply with the requirements imposed by GDPR while still leveraging data analysis.
According to some commentators, data scientists spend up to 80% of their time on housekeeping tasks like data location, organisation, cleaning and de-duplication and only 20% on actual data analysis. While this 80/20 rule is based on observation rather than fact, it is certainly the case that such ancillary tasks can end up taking a disproportionate share of a data analyst’s time. When you then add stringent requirements for data privacy it can only become worse.
Analytics is a widely used term in the modern world. And like many such terms, different people mean different things when they use it. In this paper we’re going to define it in fairly broad terms as the process of taking a large (Nowadays people often refer to “big data” when they are talking about analytics. While there is a close linkage, the overwhelming majority of datasets don’t really justify the big data sobriquet.), often diverse, dataset and extracting valuable insights that can be used as part of your business intelligence and planning. The important aspect here is that the analysis should generate valuable and useful data.
This paper focuses on how to create the best privacy-preserving data analytics stack. That means one that will help increase productivity, reduce the time lost in data housekeeping while guaranteeing data privacy. Our aim is to give data analysts a firm understanding of some important concepts in data privacy and security, along with explaining how these interact with the actual analytics stack.
Data privacy has always been important, especially for companies dealing with individuals and the general public. You only have to look at significant data breaches such as when hackers stole the details of almost 150 million Equifax customers, to see the enormous damage a company can suffer. That breach is estimated to have cost them $60-75 million in lost profits as well as impacting future sales. Since the General Data Protection Regulations (GDPR) became law in April 2018, data privacy has assumed increased significance for any company who deals with customers who are EU citizens or residents. As a result of this, data privacy and security often end up taking a significant share of an analyst’s time.
Through the course of this paper, we will explain the constituent parts of an analytics stack, explain the requirements for data privacy and security, explore how anonymisation can help achieve these and then look at how to choose the right tools for you analytics project. Overall the aim is to show you that with the correct analytics setup you can claw back some of your lost productivity and spend your time on actual data analysis rather than data housekeeping tasks.
An Abstract Model for Data Analytics in the Modern World
Before explaining how to choose the best analytics tool stack, we first need to create an abstract model for the stack. This allows us to discuss the required functionality without being wedded to preconceived ideas about the capabilities and limitations of specific tools such as Postgres. The inspiration for the following is the well-known TCP-IP model, beloved of network engineers and computer science professors everywhere. That model uses the concept of horizontal abstractions or layers and vertical abstractions or entities.
Table of Content
An Abstract Model for Data Analytics in the Modern World
The Vertical Abstraction
The Horizontal Abstraction
Combining the Abstractions
Applying the Model
Other Data Abstractions and Standards
Data Origination and Data Exchange
Ensuring Data Security and Privacy
Why is analytics harder in a GDPR world?
Privacy by Design
The Power of Anonymisation
What is anonymisation?
Exploding some myths about anonymisation
How to use anonymised data in analysis
How do I choose the best tools for my project?
Data Storage Tools
Data Security and Privacy Tools
Data Analysis and Visualisation Tools
How Aircloak can help
How does Aircloak Insights work?
Diffix – the Technology behind Aircloak
The Vertical Abstraction
In a 2017 blog post on designing a modular analytics stack, David Wallace at Mode divides the task of data analysis into the following key elements: data collection, data consolidation, data warehousing, data transformation / processing and data analysis / BI. Each of these modules exists as a discrete entity with data flowing through as shown in the following diagram. Tristan Handy of Fishtown Analytics further refines this model into one consisting of just three main elements: Data loaders, data warehouses, and data consumers. This forms the basis for our model.
Modern businesses are generating data constantly. Often this data is an unintentional by-product of the main business process. As a simple example look at geo-tagging of smartphone photos. The primary data in this case is the photo, but the geo-tag provides valuable extra data that can be mined using big data techniques – something that wasn’t fully appreciated by big tech companies originally. In other cases, the data is a core part of the business (for instance account details for a bank), but there is enormous untapped value in extracting and analysing that data. So, the first element of any data analysis stack needs to collect relevant data from whatever sources are available, consolidate it and pass it on to the next part of the stack.
The massive growth in data and in particular the advent of big data has seen the emergence of a new category of data storage, namely the data warehouse. In this abstract model a data warehouse stores all the data collected by the data loader and is able to transform and deliver it in the required form to the data consumer. Here we are using data warehouse to cover any form of entity used for storing data for later analysis. In some cases, this may be an actual data warehouse (e.g. a massive data store in a cloud-connected data centre), but in other cases it may simple be a storage server in your own premises.
The final piece of this model is the data consumer. There are many types of data consumers, but essentially most of them seek to take the data passed to them and extract useful business intelligence or insights from it. Some data consumers are automated dashboards, others might be a skilled data analyst. Or they may be a senior executive trying to get some summary figures to show how the business is growing and performing. A key thing to note is that not all data consumers are internal to your organisation. They may be a contractor or a collaborator, or they may be a 3rd party organisation buying your data.
The Horizontal Abstraction
The vertical abstraction just described identifies the discrete elements or entities within the system, but it doesn’t identify the required functionality. For that we need to look at a horizontal abstraction where we divide the problem into layers. Each layer represents a functional abstraction. As you will see, the entities in the previous section are composed of different sets of these underlying functions.
Locating raw data and ingesting it is the fundamental function of an analytics stack. Often this process will require ingesting large streams of data which may even be generated in real-time (for instance data coming from fitness trackers). One of the most important aspects here is data provenance. You have to know where the data came from and if there were any restrictions attached to using the data. If there were existing restrictions, then these need to be recorded as part of the metadata for this data.
Data Extraction, Transformation and Loading (ETL)
ETL is the process of taking the raw data found during the ingestion stage, extracting the meaningful/useful data from it, transforming it to a suitable form for your chosen storage medium (e.g. key-value store, blob store, relational database, etc.) and finally loading the data into the storage. In the past, one of the key jobs of ETL was to reduce the data down to a manageable size for storage. This often meant being quite aggressive with the extraction stage. Nowadays, the significantly reduced cost of storage often means it’s better to try and store as much of the original data as possible. In the future, this data may well prove to be valuable to you, and if you haven’t stored it you can never recover it.
The data cleaning function covers several aspects including finding and removing corrupt data, finding and removing duplicates and removing irrelevant data. Where data has been generated as part of a large stream this process may be quite extensive, and depending on the source of the data, it can be hard. As an example, if the data consists of medical notes, you may need to use some form of OCR (optical character recognition) along with Natural Language Processing (NLP) to extract the underlying meaning in the data. As data science develops as a discipline, new cleaning functions are emerging, such as identifying duplicate entities that exist across multiple datasets.
Security and Privacy
Technically, data security and data privacy are separate concepts, but within this abstraction they happen in the same layer. Data security refers to the process of ensuring only authorised people are able to access and interact with the data while data privacy refers to the process of ensuring that any data that is revealed outside your security perimeter doesn’t contain personal identifiers, except where such data sharing is allowed.
Performing the actual analysis of the data to extract useful insights and information is the main point of data analytics. This function covers a whole range of tasks from query generation to constructing complex algorithms and using machine learning to process data.
The highest layer in this abstraction is data presentation. Here the results of the analysis are presented in a usable form through visualisations or other techniques. This is where the data can finally be used for business intelligence, planning or review. Some presentation tasks may relate to metadata, for instance visualising the size of a database or illustrating the nature of the data being stored.
Combining the Abstractions
To create a full model of the analytics stack we need to combine the horizontal and vertical abstractions. The following image illustrates how these elements work together.
In the figure on the right hand side, functions can either form a key part of an entity, or they can be a subsidiary function. What you will notice is that Security and Privacy is the only function that exists across all the entities. This is because at every stage you have to ensure you are keeping your data secure and are preventing personal data from being leaked.
Applying the Model
The abstract model above is deliberately vague about how it relates to real tools. In this section we will see a couple of examples that show how the model can be used to describe the functionality of standard tools.
Databases are one of the core components of the data warehouse entity. Once, companies may have relied on monolithic relational databases such as Oracle for handling all their data. Nowadays there are a plethora of options, and often several different databases may be used in combination to achieve the required storage. Functionally, databases exist mainly at the ETL layer, but they also may include ingestion, security & privacy and some presentation functions (relating to the metadata about the actual database).
Visualisation is one of the main tools in any data analyst’s toolbox. There are many visualisation packages and approaches out there, from constructing graphs using Excel or R to displaying live dashboards using applications like Tableau. Clearly visualisation is a Data Consumer within the definition above. Equally clearly the main functionality is at the Presentation Layer. However, it also includes significant elements of Data Privacy (since it is essential to make sure only appropriate data is being displayed), Analysis (often the analysis is done on the fly within the tool) and even Cleaning (since sometimes it’s only after visualisation that you can identify superfluous or duplicate data).
Excursus: Other Data
Abstractions and Standards
There are numerous other abstractions and data standards that have been developed over the years. Often these have been created for specific industries or use cases and are narrowly defined. In the modern world, larger amounts of personal data are being gathered than ever before. While all the data can be generically described as “personal”, the actual uses for the data vary widely. Equally, some of the data is more sensitive than other data. As an example, detailed medical records are often far more sensitive than the history of what online purchases you made.
All these different use cases and types of personal data need different models and architectures for storage, use and analysis. This is especially the case for the most sensitive data like health records and financial records. Below we give some examples of alternative abstractions.
Information models are used to describe data in an abstract fashion. The model typically explains what the data is and the semantics of how to interpret it, how the different elements relate to each other and any constraints and rules that should be applied. The aim of the model is to provide a rigorous framework within which to describe how data is used in a given domain. Information models are used to map real-world problems into an abstract space in order to simplify the process of creating software. A good example of this might be a process model for a production line.
Information models are often highly complex and abstract. A number of specific languages have been devised to try to explain or illustrate them, including IDEF1X and EXPRESS. One of the most famous of these is the Unified Modelling Language (UML) which provides a set of diagram types to describe how data and artefacts flow through a software program. These diagrams include Structure Diagrams such as Class Diagrams which describe the data and Behaviour Diagrams such as Use Case Diagrams or State Machine Diagrams that describe what the system should achieve with the data and how different elements within the system interact with each other.
Models for clinical data
CDISC, the Clinical Data Interchange Standards Consortium, is a global non-profit organisation dedicated to producing data standards relating to clinical research. They have a number of data standards such as ADaM (the Analysis Data Model), which defines the dataset and metadata used for analysis of clinical trials.
PhUSE (the Pharmaceutical Users Software Exchange) is a non-profit organisation set up by data scientists within the clinical and pharmaceutical industries. Among other working groups they have a group looking at data standards for use in this field. While PhUSE is not an official standards body, their work directly influences bodies such as CDISC and the US Food and Drug Administration (FDA) who create these standards.
Models for financial data
The EU second Payments Services Directive (PSD2) has driven a demand for new models of financial data. One of the key aims of PSD2 has been to open up competition in the banking market. In order to do this, there is a need for rigorous models for financial data. These models are designed to allow data to be exchanged securely and easily between financial institutions. It has also become one of the drivers for the booming fintech industry across Europe.
One of the most successful data models has come out of the Open Banking initiative. This has defined standards and APIs for open banking that can then be used to build commercial fintech products. In the context of this paper, the most relevant standards are the Account Information API, which provides a way to safely access and share customer account data.
Data Origination and Data Exchange
It’s useful to briefly digress here and consider two things that are closely related to data analysis. These are data origination and data exchange. Neither data origination, nor data exchange are directly part of the analytics stack, but they are key elements that need to be considered in any model of the stack.
In almost all systems, data isn’t static. New data is constantly being generated (data origination). Let’s take the simple example of a patient’s health record. Every time a patient visits their doctor, their record will be updated with notes about the consultation, the results of any tests and observations and any medicines that have been prescribed. Some of this data forms part of their permanent record, but other data expires. To make it more complicated, some data such as prescribed medicine is only valid in the short term, but the fact they have had that medicine should form part of the long-term record.
Now, when the patient attends the pharmacy to collect their medicine, the pharmacist may need to access their patient record to check the details of the prescription. Later on, the patient’s health insurer will also need to check the details of the transaction. These are examples of data exchange. Another increasingly common example of data exchange is fintech apps that allow you to aggregate all your bank account data into one place. Like data analysis, data security is essential during data exchange, and the receiving party will need to be responsible for the privacy of the data.
Ensuring Data Security and Privacy
One of the main focuses of this paper is data privacy. However, if you are collecting personal data then you can’t achieve data privacy without data security. In this section we will explore the Data Security and Privacy function in detail. We will explain in detail what we mean by each term and explore some of the techniques used.
Data Security refers to the process of keeping your data safe from unauthorised access. By that we mean access by people or machines that have no reason to access the data, even if they may be technically able to do so (for instance because they have access to your internal network). Frequently, someone may be in a position where they can view personal data, but if they have no good reason to view it then they shouldn’t be accessing it. There are a number of ways to secure your data including access controls, data encryption and physical location. We will now look at each of these in turn.
The primary form of security for most data is access control. Access control is the process of ensuring only authorised people and machines can physically download or view the data. The first step is authentication. Authentication means checking that someone is who they claim to be. Typically, this is done using password protection or, more securely, some form of two-factor authentication. Next you need to check that person is authorised to access the data. This may require a secure list of authorised users and may also add checks such as IP address filtering to ensure they are accessing the data from a known location. Finally, you really should record all data accesses. This will allow you to trace any unauthorised access, or any occasions where an authorised person inappropriately accessed the data. Often this process of verifying a person’s identity is only done periodically, for instance, when they first log into the system. After that they will generally be given some sort of token such as an API key that will allow the system to know they are who they say.
The next form of data security is encryption. This encryption may happen in the actual storage or it might happen while the data is in transit. If the storage itself is encrypted, then this will protect against physical theft of hard drives and other direct attacks. If the data is encrypted in transit (for instance by using a protocol such as Transport Layer Security or HTTPS), then this will protect it from interception attacks. In both cases, the encryption is only secure so long as the keys remain secure. Often the process of access control goes hand in hand with key security.
As mentioned above, one of the reasons to encrypt data is to protect it against physical attacks. Physical security is a key element of data security. Due to the volumes of data and processing required, most companies nowadays store their data in remote locations, typically large data centres. These data centres operate extremely high levels of physical security. Only authorised personnel are allowed to access the data centre. Racks are generally locked with keys only being released to authorised people. Groups of racks (known as a pod) may be further secured by being enclosed in a cage. The idea is to prevent people stealing the hardware or being able to connect unauthorised machines to the servers to access the data. While this security is partly because of the intrinsic value in the hardware, it’s also a key element in the data security.
In this section we give a broad overview of what is meant by data privacy, what data is covered (and what data isn’t) and how this relates to our abstract model. Within Europe, data privacy is enshrined in law by the General Data Protection Regulation (GDPR). We will discuss the impact of GDPR later in the paper.
What do we mean by data privacy?
In the context of this paper, data privacy refers to the process of protecting sensitive personal data. By this we mean preventing that data from being released without the informed consent of the individual, except where other legal obligations require its release. For many organisations, this data forms an integral part of their customer data. Often, they need to be able to access this data to perform their legitimate business functions. However, they should not access it for other purposes or share it without permission.
What data is covered by data privacy?
Generally, data privacy is interested in personal data – in other words it specifically relates to an individual. The GDPR states “‘personal data’ means any information relating to an identified or identifiable natural person”. This is quite a broad definition and covers a wide range of data. The following is a far from exhaustive list of the things that are covered: name, gender, sexual orientation, disability, address, phone number, email address, physical location, identification numbers, employment details, phone records, usernames, social media handles and passwords. Data privacy techniques can also extend to protecting companies and other entities, for instance protecting confidential or privileged information, trade secrets, etc.
The simple test to decide if something is covered is whether knowing that data might make it easier to identify the individual or entity involved. The problem is that whether the data makes a person more identifiable may depend on the circumstances. This means there are grey areas such as knowing an individual’s employer – if they work for a large company then knowing this information won’t help identify them, but if they work for a company with only 2 or 3 employees then it clearly does help identify them.
What data is not covered?
As already mentioned, whether or not data is covered comes down to whether it can be used to identify the individual or entity involved. This often depends on context as in the case of someone’s employer mentioned above. The interesting thing is that even apparently obvious personal identifiers may not be covered if they don’t make it possible to identify someone. The classic example here is someone’s name. Within the UK there are about half a million people with the surname ‘Smith’, so knowing someone is British and has that surname is not a personal identifier. Other things may not be covered as personal data but might well be included under the requirements for data privacy. This includes certain transactions such as purchasing history or bank balance (of itself, knowing there is an individual with a current balance of €4,500 is not going to allow you to identify that individual, though combining that with other data may allow you to identify them).
Another major exemption for data privacy is when an individual has explicitly allowed the data to be shared. For instance, a customer might have allowed you to share their email address with another company. Or a customer may have left a public review on your website which reveals their username. The important thing here is that such consent should be properly informed, meaning it can’t be hidden in the small text, or rely on a pre-ticked checkbox, and also the fact of the consent being given should be stored as part of the data.
How does this affect the abstract model?
The two key things with data privacy are to know when a piece of data is covered and to keep records of all permissions given relating to the data. This means that data privacy has to be considered in all three entities in the model. Data Loaders need to know whether they have the rights to a piece of data (e.g. they can’t necessarily scrape data relating to an individual and assume they are allowed to use it). Where they do have the right to use the data, they need to know if the owner of the data has given consent for it to be shared. The Data Warehouse needs to store the data along with all the consents that have been given. Finally, the Data Consumer must be sure that any data that is being released doesn’t breach data privacy.
Why is analytics harder in a GDPR world?
As we mentioned in the introduction, GDPR came into force across the EU in April 2018. GDPR is one of the most far-reaching data protection laws anywhere in the world, and as such it has had a huge impact globally. This is because, unlike many national data protection laws, GDPR applies to any company that deals with EU residents, wherever they are in the world. In this section we will look at the specific impact GDPR has had on analytics.
We begin with a brief refresher on what data GDPR covers and why you need to comply.
How does GDPR define personal data?
As already stated, GDPR covers any data that can be used to identify a person. It says that:
“An identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”
The complexity here lies in the fact that it covers data that might indirectly identify a person. Often a single piece of data on its own will not be sufficient to identify a person (for instance the John Smith example we already met). However, if you also know that John Smith lives in Edinburgh and works for Starbucks then you may have narrowed it down enough to identify him.
How does GDPR define consent?
GDPR requires that individuals must have freely given informed consent before their personal data can be used. This means that not only must they give the consent, they must understand precisely what they are consenting to and must take some explicit action to indicate this. Importantly, this consent can also be withdrawn at any time.
What penalties can GDPR impose?
GDPR carries some of the toughest penalties in the world for any company that breaches the rules. The potential fines are 20 million Euros or 4% of annual global turnover, whichever is higher. The EU will also have the power to ban non-compliant organisations from trading with any nation that has adopted the GDPR into national law.
What other rights does GDPR give to individuals?
A key plank of GDPR is that it grants individuals certain additional rights relating to their personal data. Among others, these include:
- The right to be forgotten (meaning they can ask for all their data to be deleted)
- The right to request a copy of all their data
- The right to withdraw consent for their data to be used at any time
What else is covered by GDPR?
GDPR also imposes several other requirements on companies. One of these is that they must take suitable care to ensure that all data is secured. Specifically, a company must “implement appropriate technical and organisational measures to ensure a level of security appropriate to the risk.” This requirement is purposely described in a flexible manner to take account of the specific circumstances of a company. For instance, a boutique coffee shop who keeps a customer mailing list would not be expected to put in place such stringent security measures as a bank.
Another important requirement is the need to promptly notify the authorities and any affected individuals in the event of any security breach. GDPR defines ‘promptly’ as being within 72 hours of a breach being identified. Companies must also have a plan in place to deal with potential breaches (a so-called Incident Response process). A key part of this is ensuring that all staff have appropriate training and that management in particular know exactly what should be done if they are informed of a potential data breach.
GDPR is not the only data protection legislation you need to be aware of. Many companies might find themselves covered by the US Health Insurance Portability and Accountability Act (HIPAA). Like GDPR, HIPAA has rules covering privacy, security, breach notification and compliance. Unlike GDPR, HIPAA carries both civil penalties (fines) and criminal penalties (with a potential maximum sentence of up to 10 years in prison). However, unlike GDPR, HIPAA only relates to data about health care and health insurance.
While GDPR applies across the EU, its requirements are a minimum standard that must be met. In certain cases more stringent rules may be applied by member countries or states. For instance, within Germany, certain states like Brandenburg have far more strict rules relating to health records.
Privacy by Design
One of the greatest impacts of GDPR is it means companies need to adopt the principle of privacy by design. This directly impacts how a modern analytics stack needs to be built. For instance, the requirement that consent can be withdrawn at any time means that you can’t assume that because a given analysis didn’t breach GDPR in the past that will always be true. Consent may have been withdrawn in the meantime, and so this needs to be explicitly checked each time you run the analysis.
It also pays companies to design a system that allows an individual to easily access their own data and to give or withdraw their consent. Clearly this has big implications for the Data Security function since it is vital to ensure that it really is the correct individual who is requesting the data.
What if my existing system isn’t compliant?
Of course, many companies have already got existing analytics stacks. For these companies they have the choice of either retrospectively adding in privacy-preserving techniques or replace their entire stack. Which of these options is best will depend on a number of factors. These include the scale of your system, the volume of data you hold, whether your data includes large amounts of personal data and whether you expect to need to use that data.
Later in this paper we will explain how anonymisation can help you achieve GDPR compliance without the need to completely replace your analytics stack. We will also show why pseudonymisation and other simple approaches to anonymisation may not suffice.
The Power of Anonymisation
Since GDPR only relates to personal data, any data that is not personal is not covered by the regulation. This means that if you are able to completely remove any personal identifiers from the data, that data is no longer subject to the rules. This is where anonymisation comes in.
What is anonymisation?
Data anonymisation is the process of taking the personal data and modifying it in such a way that it can no longer be used to identify an individual. As a result, data that has been properly anonymised can be freely used, shared and transferred without being covered by GDPR or any other legislation.
However, true data anonymisation is difficult to achieve, which means many data controllers fail to do it properly and completely. Furthermore, proper anonymisation can destroy the data to such an extent that it no longer has any value. This is due to the requirement to ensure that you aren’t revealing data that can be used in combination with other details to identify an individual. Understanding how inference can be used to identify individuals from data is difficult and hence many data controllers take a conservative approach and assume almost all the data will need to be changed.
Pseudonymisation vs. anonymisation
Because anonymisation is so hard, people often try to use pseudonymisation to protect personal data. Pseudonymisation is the process of directly replacing personal identifiers with some form of random identifier. For instance, you might replace names with random sequences of letter and phone numbers with random numbers. The trouble is that although this may give the illusion of protecting personal data, it doesn’t give strong enough protection against inference attacks.
Inference attacks work by combining data points in order to re-identify an individual with high probability of accuracy. As an example, take a dataset containing health records. You might pseudonymise the data by replacing all the names and healthcare IDs with unique random numbers. However, an attacker may be able to use a combination of other data such as location, medical history, gender and age to make an accurate guess as to the real identity of the person.
Why Pseudonymisation isn’t enough
GDPR Recital 26 makes it clear that pseudonymisation is not strong enough to protect personal data. Pseudonymisation permits data controllers to handle their data more liberally, but it does not abolish all risks due to the possibility of re-identification. Recital 26 states that “The principles of data protection should apply to any information concerning an identified or identifiable natural person. Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person.” Crucially, this means pseudonymous data is still subject to privacy regulations under GDPR.
Recital 26 then distinguishes explicitly between pseudonymised and anonymised data. It explains that pseudonymised data “should be considered to be information on an identifiable natural person.” It then states: “The principles of data protection should […] not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” It makes it clear: “This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.” In short, if your data is anonymised it is no longer subject to the GDPR, while if it is simply pseudonymised GDPR still applies.
Anonymisation can generally take one of two approaches. In the first approach, you completely anonymise all the data before you query it. This approach is the traditional approach adopted by many companies. The other approach is dynamic (or interactive) anonymisation. Here the data is only anonymised as part of the query process. This has the benefit that the analysis will be more accurate/useful, but it is much harder to achieve.
Traditional approaches include k-anonymity. The concept of k-anonymity provides a way to measure the anonymity of data. k is defined such that each person cannot be uniquely identified from least k-1 others. In other words, in a k=5 anonymised dataset, it is impossible to distinguish each person from at least 4 others. This is achieved by generalising data and replacing it with ranges, etc., until the required level of anonymity is reached. The trouble is, producing optimally k-anonymised datasets is known to be an almost-impossibly hard problem, moreover, it doesn’t even guarantee anonymity since the data can still be cross-referenced with other publically-available datasets to enable re-identification. In other words, it is fundamentally flawed as an approach.
One of the best-known dynamic approaches is differential privacy. Here, random noise is added to the results of the query to ensure that it isn’t possible to differentiate one user from another (hence “differential”). The maths behind this is quite complex, and to achieve true differential privacy is hard, hence companies often adopt more relaxed forms such as randomised differential privacy. Mechanisms like differential privacy provide strong anonymity, but they are hard to implement and suffer from the issue of “privacy budget”. The problem is that when you use truly random noise to anonymise your data, every time you query the same data, you reduce the level of anonymisation. This is because you are able to use the aggregate results to reconstruct the original data by filtering out the noise. In differential privacy, the value epsilon (?) is used to determine how strict the privacy is. The smaller the value, the better the privacy but the worse the accuracy of any results from analysing the data. Also, the smaller the value of ?, the fewer times you can access the data (effectively ? is proportional to your privacy budget). The original authors of differential privacy made the assumption that ? would always be set to a value “much less than one.” However, a 2018 paper looking at the issues with deploying differential privacy in the US Census Bureau discovered that practitioners were picking values of ? sufficient to give the required accuracy and then were tripling these. This was to ensure they could repeatedly sample the same data set without the need to re-anonymise it.
To retain data utility, data controllers typically combine a complex variety of mechanisms that all provide some anonymity but may not protect against re-identification in all cases. These include rounding, cell swapping, outlier removal, aggregation, sampling, and others. Getting this right requires substantial expertise and again, if done completely right, this process typically destroys the utility of the data.
The problem comes down to the fact that data sometimes may count as personal and sometimes doesn’t, depending on the nature of the dataset. If you take a conservative approach that guarantees anonymity in all cases, you very likely will end up with no usable data. But if you take a more nuanced approach that preserves useful data you won’t be able to demonstrate that you have preserved anonymity. Essentially, this means that each time you run an analysis you need to re-assess whether the data released is suitably anonymised. Not only does this become extremely time consuming, but it also carries a good deal of bureaucracy as you have to be able to satisfy the data controller and the competent data authority that you are doing it correctly.
Exploding some myths about anonymisation
The problems outlined above have given rise to some myths about data anonymisation. The biggest of these is that “Anonymisation will always destroy my data”. This myth has arisen because of the tendency for people to take the conservative approach of applying anonymisation aggressively and uniformly across all the data. As already pointed out, doing anonymisation correctly is a hard problem (indeed, in some cases it may be an insoluble problem). But that does not mean that anonymisation has to destroy your data.
Another commonly held belief is that pseudonymisation is pseudonymous with anonymisation. This is partly because in many jurisdictions, anonymisation is not clearly defined, or is defined in such a way that it encompasses pseudonymisation. However, in the EU, anonymisation and pseudonymisation are clearly distinguished. Anonymised data is exempted from GDPR but pseudonymised data is not.
The final common myth is that anonymisation is simply not possible. This arises from the fact that people know the issues that anonymisation techniques such as k-anonymity have with preventing re-identification. Consequently they assume this means it is never possible to fully anonymise data.
How to use anonymised data in analysis
One issue some analysts have is understanding how to use anonymised data in their analysis. It can be easy to fall into the trap of thinking that because the data is anonymised, it will be harder to work with. However, in practice you can often use anonymised data in exactly the same way you use raw data.
Let’s look at a simple example. Imagine you work for a bank and have been asked to generate a simple analysis. This will show the link between customer deposits, demographics and location. More specifically, the bank wants to know how many of its customers under 35 who bank in a certain branch deposit at least €2,500 a month. The data should also include gender and how much money the customers have in their account. This data will then be used for targeted marketing in that branch.
Using the raw data to generate this would be simple. Find all customers in that branch. Then find the ones that are under 35. Then remove those that don’t reach the required monthly deposit level. Finally use the data about gender and bank balance to generate the resulting plot. With anonymised data the process is similar. However, you have to be aware of how the anonymisation has been done. For instance, if the customer ages have been binned into decades (<20, 20-30, 30-40, etc.), you won’t be able to identify all the customers under 35, or you will end up including some customers aged 35-40. Equally, if it turns out that only a handful of the customers are male, then you won’t be able to see the gender vs. account balance results for males.
How do I choose the best tools for my project?
So far, this paper has looked at the analytics stack in a purely abstract manner. In this section we give you some advice on how to select tools for your stack. As with most things, the right tool for one setting won’t be right in another setting. So, this advice covers the things you should consider when you select a particular tool. For simplicity, we will divide these tools up into data storage, data privacy, data analysis and visualisation.
Data storage tools
Once upon a time, companies had very few choices for how to store their data. They could use some form of structured database (often MySQL, DB2 or Oracle), they could use spreadsheets (amazingly, this is still done by many companies) or they could use some form of roll-your-own unstructured data store. The advent of the cloud, coupled with the growth in big data, has seen a plethora of new storage approaches emerge. Broadly, these approaches can be grouped into 3 groups: highly structured databases, lightweight key-value stores and unstructured storage (like NoSQL). Which approach is right for your analytics stack will depend on a number of factors which we discuss below.
Volume of data
The first thing to consider is the volume of data. This will directly influence your choice of storage. Really huge datasets may need to be stored in some form of unstructured data lake, probably using a modern NoSQL protocol, or they may require you to use a proprietary database such as Microsoft Azure SQL Data Warehouse. By contrast, a customer database for a small business may be able to be stored on a single server (with appropriate backup of course).
Nature of the data
The nature of the data will directly influence the nature of the storage. Some data has very obvious natural structure that has to be preserved. A good example here is health records which will always contain certain items of data like date of birth, details of any allergies or medical conditions, lists of vaccinations, what medication has been prescribed, etc. Other data may naturally lend itself to key-value style storage. Data with little structure or relatively random structure may not be suitable for storage in an SQL database, in which case you need to choose an alternative.
Use case for data
In many cases, analysis may be a secondary use case for your data. Take the example of a bank. Here, the primary use case for customer account data is to provide banking services. This means that your data storage must be suitable for your customer-facing and in-branch systems. Considerations here may include required speed of access, API constraints and indexing or search requirements. You may even decide that it is better to have two separate versions of the data, one for analytics and the other for the main business use.
One thing you need to be cognisant of is any restrictions on data location that may need to be followed. For instance, your company policies may state that all data must be held on company premises. If this is the case, then you can’t use a cloud-based data warehouse tool such as Google’s BigQuery. It’s worth highlighting here that GDPR does not ban you from storing personal data outside the EU. However, it does require you to ensure that data is appropriately protected and that that protection is legally binding.
Data security and privacy tools
GDPR mandates that you take reasonable steps to ensure your data is stored securely by means of “appropriate technical and organisational measures”. It is deliberately vague about what these measures constitute, but it does make certain suggestions. When considering what measures are appropriate you need to assess the risks and use a combination of organisational policies and physical or technical security measures. You must make sure you consider the current state of the art regarding security. So this means being aware of the latest approaches like two-factor authentication, TLS, encryption, etc. However, you are also allowed to take cost into account – for a small shop, paying tens of thousands on a hardware firewall would not be reasonable.
One of the important requirements that is easily overlooked is that you must be able to restore full access to your data and systems in a “timely manner” following any serious incident. In effect this means you must have a disaster recovery plan. This is part of the requirement to ensure the “confidentiality, integrity and availability” of your data. Therefore, you should also consider whether you need to include anonymisation tools right in your analytics stack, or whether you will use other approaches to ensure data privacy is maintained.
Data analysis and visualisation tools
Data analysis is a very broad term – Wikipedia defines it as “… a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.” There are a huge number of tools that can be used to help perform data analysis tasks. These include tools for extracting and finding data, tools for modelling the data and tools to extract information from the data.
Locating the correct data is a key part of data analysis. There are a number of approaches to data mining and searching. Some analysts might need the ability to run SQL queries in order to access the data (in which case your system must be able to parse SQL). Others may want to use a natural language query tool to search the data. Yet others may use simple keyword searching. Whatever your analysts’ requirements, your analytics stack needs to offer the right support, while remaining cognisant of the requirements for data privacy.
Building models that use the underlying data to predict future events is a key task for many data analysts. These models may be simple VBA scripts in an Excel spreadsheet, or they may be more complex models created in R or SPSS. Again, your stack will need to offer suitable hooks and access to allow these models to be created and to produce usable data.
Often data needs to be visualised as an integral part of the analysis. And almost certainly, visualisations are necessary to disseminate the results of the analysis. Once, data visualisation was limited to simple things like graphs plotted in R or charts created in Excel. Over recent years there has been an explosion in the field of data visualisation, with new chart types being invented and systems that are able to display dynamic dashboards built with point-and-click interfaces.