WA Health Hackathon 2023

Frequently Asked Questions about Synthetic Data

What is Synthetic Data?

Synthetic data is artificial data that is manufactured by statistical and machine learning algorithms. Synthetic data is more than masked or de-identified real data. The artificial data is carefully monitored to make sure that it is statistically similar to real data, but the individual data points are different. Records in synthetic data do not exist and are a great way to open access to health data as they don’t risk showing any private health information. 

What did the Department of Health do to create these synthetic data sets?

As part of the WA Health Data Linkage Strategy 2022 – 2024 the Department of Health used statistics-based and generative AI-based data engines to create three synthetic datasets. The representativeness of these synthetic data sets was ensured by a large amount of post-processing. 

The representative synthetic datasets that the Department of Health will be releasing for the WA Health Hackathon 2023 have passed all quality check in terms of their representativeness to the real data sets.   

How can you tell that there’s no private information in these synthetic data sets? 

There is no private information of any WA patients in these synthetic data sets. The Department of Health has encrypted any potentially personal information. For example, the field ‘person_ID’ in the synthetic data does not represent any ‘person_ID’ that you may have see in any datasets elsewhere. So, the records in the synthetic data sets do not match to any real patients. Additionally, to protect institution-level privacy attributes like hospital IDs, were also encrypted.  

What are the strengths and drawbacks of these synthetic data sets? 

Synthetic data sets are important for health data for two reasons: 

  1. Synthetic data preserves privacy; 
  1. The synthesised data set retains the representativeness of the real data. 

 

However, to ensure the privacy of patients we have excluded rare diseases and anything occurrence that was unusually infrequent. These rare diseases do not occur frequently so the synthesised datasets will still be very valuable for data analysis and the development of new algorithms and applications

When I’m using the synthetic data sets is there anything I need to change in how I conduct my analysis?

No. There is nothing you need to change in how you conduct your analysis. However, please keep in mind that data sets being released for the WA Health Hackathon 2023 do not have a linking key between them. They are separate data sets.