Utility Assessment of Synthetic
Data Generation Methods |
Md Sakib
Nizam Khan, Niklas Reje and Sonja Buchegger |
In this era of big data and
artificial intelligence, technologies are becoming increasingly dependent on
data processing and analysis. A fair share of the data used by these
technologies is the privacy-sensitive data of individuals. There is a growing
need for methods that facilitate privacy-preserving
data analysis. Traditional privacy-enhancing approaches such as k-anonymity,
l-diversity, and differential privacy add noise to the data for privacy protection which in turn reduces the utility of the data.
Synthetic data generation is another technique that aims to protect the
privacy of individuals while providing a utility that is close to the original
data. The benefit of synthetic data generation techniques in terms of privacy
is that the data is not directly linkable to the original data since the data
is not real. Nonetheless, the utility provided by such data can be
problematic. In this work, we investigate the utility provided by different
synthetic data generation techniques. We study different methods and
parameters for synthetic data generation. We also investigate the correlation
between the general utility metrics commonly used for the evaluation of
synthetic data and the accuracy of different analyses performed on the data.
Our investigation reveals that the choice of synthetic data generation
method, the number of datasets to release, and sometimes the imputation order
can impact the utility of synthetic data. Since our
findings reveal that the best-performing synthesizer and the choice of
parameters can vary, we provide a framework that facilitates the comparison
of different synthetic data generation techniques for a given dataset and
analysis. |