“Re-identification risk is the probability that an adversary will correctly match a record in a dataset with a real person and until now, there has been no sufficiently reliable measure of this risk,” said Dr. Khaled El Emam, Senior Vice-President and General Manager of Replica Analytics, the premier science-based synthetic data generation technology provider to the healthcare industry. “Access to data and sharing de-identified datasets remain a challenge, in part due to privacy concerns. The re-identification risk estimator we have developed should help data custodians overcome those challenges.”
Most existing estimators provide a proxy for risk based on strong assumptions, as they cannot calculate the risk on a population because real population data is rarely available. Replica’s estimator leverages data synthesis technology to simulate the unavailable population dataset, so that re-identification risks can be calculated much more accurately. Synthetic data generation (SDG) involves training a machine learning model to master the statistical patterns and properties of a real dataset. The trained model, when implemented properly, is then used to create a synthetic dataset which maintains the traits of the original dataset, but with no one-to-one mapping back to a person, so the synthetic data mitigates privacy risks.
Measuring re-identification risk using a synthetic estimator to enable data sharing, a study recently published by the journal, PLOS ONE, includes a detailed analysis of the concepts behind Replica’s new risk estimator, an evaluation of its performance and relevant case studies. The results show that the estimator reliably outperforms other approaches, across different dataset sizes and varying complexity, achieving a high degree of accuracy, and offering a consistent estimate of the probability of re-identification risk. The study was also the focus of a webinar and blog post.
The new approach is another example of the usefulness and effectiveness of SDG technology in assessing and mitigating privacy risks and enabling data sharing. Replica’s estimator can now be used through the Replica Synthesis software to better assess re-identification risks in real datasets. If the risk is deemed too high, organizations can choose to synthesize the data and then use the company’s privacy assurance functionality to measure any risk in the synthetic data to demonstrate that it is much lower than the real data.