|
|
## Generating Synthetic Data from OMOP-CDM Databases for Health Applications
|
|
|
|
|
|
Analysis of Electronic Health Records (EHR) has a tremendous potential for enhancing patient care, quantitatively measuring performance of clinical practices, and facilitating clinical research. Statistical estimation and machine learning (ML) models trained on EHR data can be used to predict the probability of various diseases (such as diabetes), track patient wellness, and predict how patients respond to specific drugs. For such models, researchers and practitioners need access to EHR data. However, it can be challenging to leverage EHR data while ensuring data privacy and conforming to patient confidentiality regulations. Here we present an approach fo generating synthetic health data from an OMOP-CDM. The goal of this study was to develop and evaluate a model for simulating longitudinal healthcare data that adequately captures clinical data temporal and conditional complexities.
|
|
|
|
|
|
![image](uploads/4454cdfc872daaa424b3d651255e867c/image.png)
|
|
|
|
|
|
Generating synthetic data comes down to learning the joint probability distribution in an original, real dataset to generate a new dataset with the same distribution. Deep learning models such as generative adversarial networks (GAN) and variational autoencoders (VAE) are well suited for synthetic data generation but have problems capturing temporal and causal dependencies in the data or generating categorical variables that are common in clinical data.
|
|
|
|
|
|
### Synthetic data generation
|
|
|
|
|
|
We propose a novel generative modeling framework that combines GANs with a bidirectional encoder representations from transformers (BERT) architecture. We first train the encoder-decoder model using a reconstruction loss. Then, we use the trained encoder to transform the original inputs into latent space (encoder states). Lastly, we train the GAN framework using an adversarial loss in the latent space. To incorporate temporal data across multiple clinical domains. We use a hybrid approach by augmenting the input to BERT using artificial time tokens, incorporating time, age, and concept embeddings, and introducing a new second learning objective for visit type.
|
|
|
|
|
|
### Synthetic data evaluation
|
|
|
|
|
|
The synthetic data generated is measured against three key dimensions: Fidelity, how similar is this synthetic data as compared to the original training set; Utility, how useful is this synthetic data for our downstream machine learning applications; and Privacy, Has any data which is considered sensitive in the real world been inadvertently synthesized by our model?. The first step in fidelity evaluation is to analyze whether the distribution of synthetic attributes is equivalent to the distribution of the RD. we consider the following metrics: Kullback-Leibler (KL) divergence, pairwise correlation difference, log-cluster, support coverage, and cross-classification. To evaluate utility we use different classification metrics (e.g., Accuracy, F1-score, ROC, and AUC-ROC) to evaluate and analyze the differences in the models’ performance when training the models with real data and synthetic data. To quantify the robustness of the synthetic data with respect to privacy, we consider three different privacy attacks: Membership inference attack,, Re-identification attack and attribute inference attack
|
|
|
|
|
|
![image](uploads/56b60ff07dfd0c2375c9ce092aab737b/image.png)
|
|
|
|
|
|
### References
|
|
|
|
|
|
1. Murray Reet al. Design and validation of a data simulation model for longitudinal healthcare data. AMIA Annu Symp Proc. 2011
|
|
|
2. Pang etal. CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks. Proc. of Machine Learning for Health, 158, 2021.
|
|
|
3. Yoon et al. EHR-Safe: Generating High-Fidelity and Privacy-Preserving Synthetic Electronic Health Records https://doi.org/10.21203/rs.3.rs-2347130/v1 |
|
|
\ No newline at end of file |