Tabular Synthetic Data Generation using CTGAN

Last updated on Dec 18, 2020 3 min read Deep Learning, GANs

Checkout mdedit.ai, AI powered Markdown Editor for tech writers

In this post we will talk about generating synthetic data from tabular data using Generative adversarial networks(GANs). We will be using the default implementation of CTGAN [1] model.

png

Introduction

In the last post on GANs we saw how to generate synthetic data on Synthea dataset. Here’s a link to the post for a refresher:

https://www.maskaravivek.com/post/gan-synthetic-data-generation/

Similar to the last post, we would be working with the Synthea dataset which is publicly available.

https://synthetichealth.github.io/synthea/

In this post, we will be working on the patients.csv file and will only be using continious and categorical fields. We will remove the other fields like name, email ID etc which contains a lot of unique values and will thus will be difficult to learn.

Data Preprocessing

Firstly, download the publicly available synthea dataset and unzip it.

	!wget https://storage.googleapis.com/synthea-public/synthea_sample_data_csv_apr2020.zip
	!unzip synthea_sample_data_csv_apr2020.zip

view raw 6e8cc76b-97f4-463d-8347-d2feb86f8070 hosted with ❤ by GitHub

Install Dependencies

In this post, we will be using the default implementation of CTGAN which is available here.

https://github.com/sdv-dev/CTGAN

To use CTGAN do a pip install. Also, we will be installing the table_evaluator library( link) which will help us in comparing the results with the original data.

	!pip install ctgan
	!pip install table_evaluator

view raw 3e7d3058-612e-44ea-954f-cec358119a9d hosted with ❤ by GitHub

Remove unnecessary columns and encode all data

Next, we read the data into a dataframe and drop the unnecessary columns.

	import pandas as pd

	data = pd.read_csv('csv/patients.csv')
	data.drop(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
	'FIRST', 'ADDRESS', 'LAST', 'SUFFIX', 'MAIDEN','LAT', 'LON',], axis=1, inplace=True)
	print(data.columns)

view raw 1695c2e2-87b0-47f6-b5e7-257b7c3b7934 hosted with ❤ by GitHub

Index(['MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'CITY', 'STATE',
       'COUNTY', 'ZIP', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
      dtype='object')

Next, we define a list with column names for categorical variables. This list will be passed to the model so that the model can decide how to process these fields.

categorical_features = ['MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'CITY', 'STATE', 'COUNTY', 'ZIP']

view raw 3cd0c243-c5d6-41a8-aaea-55b31708e606 hosted with ❤ by GitHub

Training the model

Next, we simply define an instance of CTGANSynthesizer and call the fit method with the dataframe and the list of categorical variables.

We train the model for 300 epochs only as the discriminator and generator loss becomes quite low after these many epochs.

	from ctgan import CTGANSynthesizer

	ctgan = CTGANSynthesizer(verbose=True)
	ctgan.fit(data, categorical_features, epochs = 300)

view raw a369033e-c4c9-4374-a93a-632e8d6458a5 hosted with ❤ by GitHub

Evaluation

Next, we simply call model’s sample function to generate samples based on the learned model. In this example we generate 1000 samples.

	samples = ctgan.sample(1000)

	print(samples.head())

view raw c8c89ba3-0547-4a10-8a4c-e7061a34317c hosted with ❤ by GitHub

  MARITAL    RACE  ... HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE
0       S   asian  ...        7.331230e+05         8940.917593
1     NaN   white  ...        1.540945e+06         3099.605568
2     NaN   asian  ...        1.517647e+06        11947.241606
3     NaN   white  ...        1.516137e+06        14091.349082
4       S  native  ...        1.534122e+06         5103.408672

[5 rows x 11 columns]

Now let’s try to do a feature by feature comparision between the generated data and the actual data. We will use python’s table_evaluator library to compare the features.

We call the visual_evaluation method to compare the actual data(data) and the generated data(samples).

	from table_evaluator import load_data, TableEvaluator

	print(data.shape, samples.shape)
	table_evaluator = TableEvaluator(data, samples, cat_cols= categorical_features)

	table_evaluator.visual_evaluation()

view raw 34218193-edcb-486a-8487-8c1484c31200 hosted with ❤ by GitHub

(1171, 11) (1000, 11)

png

Conclusion

As its apparent from the visualizations, the similarity between the original data and the synthetic data is quite high. The results give a lot of confidence as we took a random dataset and applied the default implementation without any tweaks or any data preprocessing.

The model can be used in various scenarios where data augmentation is required. Its worthwhile to highlight a few caveats:

In this dataset we just had categorical and continuous variables and the results were quite good.
It would be useful to try it on datasets with date time values
Also this model won’t be able to handle relational datasets by default. For eg. there’s no way of specifiying primary key foreign key constraints.
Moreover, it cannot handle contraints by default. For eg. a particular state should belong to a single country but there’s no way of specifying this constraint. The generated dataset can contain new combinations of (state, country) which is not present in the original dataset.

There’s a framework to mitigate some of the above issues. Checkout SDV if you are interested. I will try to write a post about it in future.

TL;DR

Here’s the link to the Google colab notebook with the complete source code.

https://colab.research.google.com/drive/1nwbvkg32sOUC69zATCfXOygFUBeo0dsx?usp=sharing

References

[1] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019

Deep Learning GANs Privacy Synthetic Data