Utility Assessment of Synthetic Data Generation Methods

 

Md Sakib Nizam Khan, Niklas Reje and Sonja Buchegger

In this era of big data and artificial intelligence, technologies are becoming increasingly dependent on data processing and analysis. A fair share of the data used by these technologies is the privacy-sensitive data of individuals. There is a growing need for methods that facilitate privacy-preserving data analysis. Traditional privacy-enhancing approaches such as k-anonymity, l-diversity, and differential privacy add noise to the data for privacy protection which in turn reduces the utility of the data. Synthetic data generation is another technique that aims to protect the privacy of individuals while providing a utility that is close to the original data. The benefit of synthetic data generation techniques in terms of privacy is that the data is not directly linkable to the original data since the data is not real. Nonetheless, the utility provided by such data can be problematic. In this work, we investigate the utility provided by different synthetic data generation techniques. We study different methods and parameters for synthetic data generation. We also investigate the correlation between the general utility metrics commonly used for the evaluation of synthetic data and the accuracy of different analyses performed on the data. Our investigation reveals that the choice of synthetic data generation method, the number of datasets to release, and sometimes the imputation order can impact the utility of synthetic data. Since our findings reveal that the best-performing synthesizer and the choice of parameters can vary, we provide a framework that facilitates the comparison of different synthetic data generation techniques for a given dataset and analysis.