Analytics and Privacy
Here you can find a comprehensive guide to build a modern privacy-preserving analytics stack. It is targeted mainly at data analysts and senior managers who want to better understand how to comply with the requirements imposed by GDPR while still leveraging data analysis.
According to some commentators, data scientists spend up to 80% of their time on housekeeping tasks like data location, organisation, cleaning and de-duplication and only 20% on actual data analysis. While this 80/20 rule is based on observation rather than fact, it is certainly the case that such ancillary tasks can end up taking a disproportionate share of a data analyst’s time. When you then add stringent requirements for data privacy it can only become worse.
Analytics is a widely used term in the modern world. And like many such terms, different people mean different things when they use it. In this paper we’re going to define it in fairly broad terms as the process of taking a large (Nowadays people often refer to “big data” when they are talking about analytics. While there is a close linkage, the overwhelming majority of datasets don’t really justify the big data sobriquet.), often diverse, dataset and extracting valuable insights that can be used as part of your business intelligence and planning. The important aspect here is that the analysis should generate valuable and useful data.
This paper focuses on how to create the best privacy-preserving data analytics stack.
That means one that will help increase productivity, reduce the time lost in data housekeeping while guaranteeing data privacy. Our aim is to give data analysts a firm understanding of some important concepts in data privacy and security, along with explaining how these interact with the actual analytics stack.
Does Data Protection Always Have to Be Tedious?
Data privacy has always been important, especially for companies dealing with individuals and the general public. You only have to look at significant data breaches such as when hackers stole the details of almost 150 million Equifax customers, to see the enormous damage a company can suffer. That breach is estimated to have cost them $60-75 million in lost profits as well as impacting future sales. Since the General Data Protection Regulations (GDPR) became law in April 2018, data privacy has assumed increased significance for any company who deals with customers who are EU citizens or residents. As a result of this, data privacy and security often end up taking a significant share of an analyst’s time.
Through the course of this paper, we will explain the constituent parts of an analytics stack, explain the requirements for data privacy and security, explore how anonymization can help achieve these and then look at how to choose the right tools for you analytics project.
Overall the aim is to show you that with the correct analytics setup you can claw back some of your lost productivity and spend your time on actual data analysis rather than data housekeeping tasks.
Before explaining how to choose the best analytics tool stack, we first need to create an abstract model for the stack. This allows us to discuss the required functionality without being wedded to preconceived ideas about the capabilities and limitations of specific tools such as Postgres.
One of the main focuses of this paper is data privacy. However, if you are collecting personal data then you can’t achieve data privacy without data security. In this section we will explore the Data Security and Privacy function in detail.
GDPR is one of the most far-reaching data protection laws anywhere in the world, and as such it has had a huge impact globally. This is because, unlike many national data protection laws, GDPR applies to any company that deals with EU residents, wherever they are in the world. In this section we will look at the specific impact GDPR has had on analytics.
Since GDPR only relates to personal data, any data that is not personal is not covered by the regulation. This means that if you are able to completely remove any personal identifiers from the data, that data is no longer subject to the rules. This is where anonymization comes in.
In this section we give you some advice on how to select tools for your stack. As with most things, the right tool for one setting won’t be right in another setting. So, this advice covers the things you should consider when you select a particular tool.
Aircloak Insights is the first solution to offer real-time database anonymization that allows analysts to query anonymized data exactly as if it were the original raw data. In this section we will explain how Aircloak’s technology works and show why it is the first GDPR-compliant tool for database anonymization.
For years, data protection was viewed as an annoying task that companies had to pay lip service to, but was often overlooked or underfunded. With the advent of the EU General Data Protection Regulations, all that changed. In this part we summarize the insights we gave you in the previous chapters.