Automatic Data Synthetization



May 29, 2019 by Sebastian Probst Eide

We are excited to share that we are working on a product for generating synthetic data suitable for machine learning. It builds on our existing anonymization capabilities, and does not require human intervention for data modeling or synthesis. We are on track to start testing the solution with select customers in Q4 2019.

Not all synthetic data is created equal. Synthetic data can be created to look sensible to humans, to serve as test data, or for machine learning purposes. Synthesized data meant to be viewed by humans tends to be of low utility and unsatisfactory when it looks different from what an analyst might expect it to look like. Creating data for testing purposes requires significant levels of domain knowledge and manual effort. Synthesizing data for machine learning however holds a lot of potential. Correlations tend to be more important to machines than cosmetics are, which makes synthetic data ideally suited for automatic creation at scale.

At Aircloak we are of the opinion that data that is meant to be interpreted by a human should not be synthetic, it should be real. This is the sweet spot of the anonymous analytics engine we offer as part of Aircloak Insights. It allows our customers to make sense of vast amounts of highly sensitive data stored in production databases in a manner that is cognizant of, and respects, the need for privacy and anonymity.

Diffix, our core anonymization technology, has given us the ability to consistently produce anonymous data without the need of human intervention or manual effort. This capability we are now leveraging to automatically capture and distill the essence of data into a synthesized form that is suitable for machine learning purposes.

Initial tests yield very promising results. We compared the performance of classifiers from scikit-learn trained to predict the grade of loan applications. The performance (rate of correct classification) of the classifiers trained on the raw un-anonymized data and the ones trained on data from our early data synthesis prototype differed by ~3.5%. Both sets of classifiers were validated using un-anonymized test data.

We look forward to helping pave the way for machine learning that does not come at the cost of our privacy!


Categorised in: , , ,