Tabular Synthetic Data Generation using CTGAN
Checkout mdedit.ai, AI powered Markdown Editor for tech writers
In this post we will talk about generating synthetic data from tabular data using Generative adversarial networks(GANs). We will be using the default implementation of CTGAN [1] model.
Introduction
In the last post on GANs we saw how to generate synthetic data on Synthea dataset. Here’s a link to the post for a refresher:
https://www.maskaravivek.com/post/gan-synthetic-data-generation/
Similar to the last post, we would be working with the Synthea dataset which is publicly available.
https://synthetichealth.github.io/synthea/
In this post, we will be working on the patients.csv
file and will only be using continious and categorical fields. We will remove the other fields like name, email ID etc which contains a lot of unique values and will thus will be difficult to learn.
Data Preprocessing
Firstly, download the publicly available synthea dataset and unzip it.
Install Dependencies
In this post, we will be using the default implementation of CTGAN which is available here.
https://github.com/sdv-dev/CTGAN
To use CTGAN do a pip install. Also, we will be installing the table_evaluator
library(
link) which will help us in comparing the results with the original data.
Remove unnecessary columns and encode all data
Next, we read the data into a dataframe and drop the unnecessary columns.
Index(['MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'CITY', 'STATE',
'COUNTY', 'ZIP', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
dtype='object')
Next, we define a list with column names for categorical variables. This list will be passed to the model so that the model can decide how to process these fields.
Training the model
Next, we simply define an instance of CTGANSynthesizer
and call the fit
method with the dataframe and the list of categorical variables.
We train the model for 300 epochs only as the discriminator and generator loss becomes quite low after these many epochs.
Evaluation
Next, we simply call model’s sample
function to generate samples based on the learned model. In this example we generate 1000 samples.
MARITAL RACE ... HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE
0 S asian ... 7.331230e+05 8940.917593
1 NaN white ... 1.540945e+06 3099.605568
2 NaN asian ... 1.517647e+06 11947.241606
3 NaN white ... 1.516137e+06 14091.349082
4 S native ... 1.534122e+06 5103.408672
[5 rows x 11 columns]
Now let’s try to do a feature by feature comparision between the generated data and the actual data. We will use python’s table_evaluator
library to compare the features.
We call the visual_evaluation
method to compare the actual data(data
) and the generated data(samples
).
(1171, 11) (1000, 11)
Conclusion
As its apparent from the visualizations, the similarity between the original data and the synthetic data is quite high. The results give a lot of confidence as we took a random dataset and applied the default implementation without any tweaks or any data preprocessing.
The model can be used in various scenarios where data augmentation is required. Its worthwhile to highlight a few caveats:
- In this dataset we just had categorical and continuous variables and the results were quite good.
- It would be useful to try it on datasets with date time values
- Also this model won’t be able to handle relational datasets by default. For eg. there’s no way of specifiying primary key foreign key constraints.
- Moreover, it cannot handle contraints by default. For eg. a particular state should belong to a single country but there’s no way of specifying this constraint. The generated dataset can contain new combinations of (state, country) which is not present in the original dataset.
There’s a framework to mitigate some of the above issues. Checkout SDV if you are interested. I will try to write a post about it in future.
TL;DR
Here’s the link to the Google colab notebook with the complete source code.
https://colab.research.google.com/drive/1nwbvkg32sOUC69zATCfXOygFUBeo0dsx?usp=sharing
References
[1] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019