Generating Tabular Synthetic Data Using GANs

Checkout mdedit.ai, AI powered Markdown Editor for tech writers

In this post, we will see how to generate tabular synthetic data using Generative adversarial networks(GANs). The goal is to generate synthetic data that is similar to the actual data in terms of statistics and demographics.

Introduction

It is important to ensure data privacy while publicly sharing information that contains sensitive information. There are numerous ways to tackle it and in this post we will use neural networks to generate synthetic data whose statistical features match the actual data.

We would be working with the Synthea dataset which is publicly available. Using the patients data from this dataset, we will try to generate synthetic data.

https://synthetichealth.github.io/synthea/

TL;DR

Checkout the Python Notebook if you are here just for the code.

https://colab.research.google.com/drive/1vBnSrTP8liPlnGg5ArzUk08oDLnERDbF?usp=sharing

Data Preprocessing

Firstly, download the publicly available synthea dataset and unzip it.

!wget https://storage.googleapis.com/synthea-public/synthea_sample_data_csv_apr2020.zip
!unzip synthea_sample_data_csv_apr2020.zip

Remove unnecessary columns and encode all data

Next, read patients data and remove fields such as id, date, SSN, name etc. Note, that we are trying to generate synthetic data which can be used to train our deep learning models for some other tasks. For such a model, we don’t require fields like id, date, SSN etc.

import pandas as pd
df = pd.read_csv('csv/patients.csv')
df.drop(['Id', 'BIRTHDATE', 'DEATHDATE', 'SSN', 'DRIVERS', 'PASSPORT', 'PREFIX',
'FIRST', 'ADDRESS', 'LAST', 'SUFFIX', 'MAIDEN','LAT', 'LON',], axis=1, inplace=True)
print(df.columns)
Index(['MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'CITY', 'STATE',
       'COUNTY', 'ZIP', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
      dtype='object')

Next, we will encode all categorical features to integer values. We are simply encoding the features to numerical values and are not using one hot encoding as its not required for GANs.

df["MARITAL"] = df["MARITAL"].astype('category').cat.codes
df["RACE"] = df["RACE"].astype('category').cat.codes
df["ETHNICITY"] = df["ETHNICITY"].astype('category').cat.codes
df["GENDER"] = df["GENDER"].astype('category').cat.codes
df["BIRTHPLACE"] = df["BIRTHPLACE"].astype('category').cat.codes
df["CITY"] = df["CITY"].astype('category').cat.codes
df["STATE"] = df["STATE"].astype('category').cat.codes
df["COUNTY"] = df["COUNTY"].astype('category').cat.codes
df["ZIP"] = df["ZIP"].astype('category').cat.codes
df.head()

MARITALRACEETHNICITYGENDERBIRTHPLACECITYSTATECOUNTYZIPHEALTHCARE_EXPENSESHEALTHCARE_COVERAGE
0040113642062271227.081334.88
104116118608132793946.013204.49
2041123642063574111.902606.40
304102911100868935630.308756.19
4-141118924012125598763.073772.20

Next, we will encode all continious features to equally sized bins. First, lets find the minimum and maximum values for HEALTHCARE_EXPENSES and HEALTHCARE_COVERAGE and then create bins based on these values.

HEALTHCARE_EXPENSES_MIN = df["HEALTHCARE_EXPENSES"].min()
HEALTHCARE_EXPENSES_MAX = df["HEALTHCARE_EXPENSES"].max()
print('Min and max healthcare expense', HEALTHCARE_EXPENSES_MIN, HEALTHCARE_EXPENSES_MAX)
HEALTHCARE_COVERAGE_MIN = df["HEALTHCARE_COVERAGE"].min()
HEALTHCARE_COVERAGE_MAX = df["HEALTHCARE_COVERAGE"].max()
print('Min and max healthcare coverage', HEALTHCARE_COVERAGE_MIN, HEALTHCARE_COVERAGE_MAX)
Min and max healthcare expense 1822.1600000000005 2145924.400000002
Min and max healthcare coverage 0.0 927873.5300000022

Now, we encode HEALTHCARE_EXPENSES and HEALTHCARE_COVERAGE into bins using the pd.cut method. We use numpy’s linspace method to create equally sized bins.

import numpy as np
df_healthcare_expenses = pd.cut(df['HEALTHCARE_EXPENSES'], bins=np.linspace(HEALTHCARE_EXPENSES_MIN, HEALTHCARE_EXPENSES_MAX, 21), labels=False)
df_healthcare_coverage = pd.cut(df['HEALTHCARE_COVERAGE'], bins=np.linspace(HEALTHCARE_COVERAGE_MIN, HEALTHCARE_COVERAGE_MAX, 21), labels=False)
df.drop(["HEALTHCARE_EXPENSES", "HEALTHCARE_COVERAGE"], axis=1, inplace=True)
df = pd.concat([df, df_healthcare_expenses, df_healthcare_coverage], axis=1)

Transform the data

Next, we apply PowerTransformer on all the fields to get a Gaussian distribution for the data.

from sklearn.preprocessing import PowerTransformer
df[df.columns] = PowerTransformer(method='yeo-johnson', standardize=True, copy=True).fit_transform(df[df.columns])
print(df)
       MARITAL      RACE  ...  HEALTHCARE_EXPENSES  HEALTHCARE_COVERAGE
0     0.334507  0.461541  ...            -0.819522            -0.187952
1     0.334507  0.461541  ...             0.259373            -0.187952
2     0.334507  0.461541  ...            -0.111865            -0.187952
3     0.334507  0.461541  ...             0.426979            -0.187952
4    -1.275676  0.461541  ...            -0.111865            -0.187952
...        ...       ...  ...                  ...                  ...
1166  0.334507 -2.207146  ...             1.398831            -0.187952
1167  1.773476  0.461541  ...             0.585251            -0.187952
1168  1.773476  0.461541  ...             1.275817             5.320497
1169  0.334507  0.461541  ...             1.016430            -0.187952
1170  0.334507  0.461541  ...             1.275817            -0.187952

[1171 rows x 11 columns]


/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_data.py:2982: RuntimeWarning: divide by zero encountered in log
  loglike = -n_samples / 2 * np.log(x_trans.var())

Train the Model

Next, lets define the neural network for generating synthetic data. We will be using a GAN network that comprises of an generator and discriminator that tries to beat each other and in the process learns the vector embedding for the data.

The model was taken from a Github repository where it is used to generate synthetic data on credit card fraud data.

import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
class GAN():
def __init__(self, gan_args):
[self.batch_size, lr, self.noise_dim,
self.data_dim, layers_dim] = gan_args
self.generator = Generator(self.batch_size).\
build_model(input_shape=(self.noise_dim,), dim=layers_dim, data_dim=self.data_dim)
self.discriminator = Discriminator(self.batch_size).\
build_model(input_shape=(self.data_dim,), dim=layers_dim)
optimizer = Adam(lr, 0.5)
# Build and compile the discriminator
self.discriminator.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])
# The generator takes noise as input and generates imgs
z = Input(shape=(self.noise_dim,))
record = self.generator(z)
# For the combined model we will only train the generator
self.discriminator.trainable = False
# The discriminator takes generated images as input and determines validity
validity = self.discriminator(record)
# The combined model (stacked generator and discriminator)
# Trains the generator to fool the discriminator
self.combined = Model(z, validity)
self.combined.compile(loss='binary_crossentropy', optimizer=optimizer)
def get_data_batch(self, train, batch_size, seed=0):
# # random sampling - some samples will have excessively low or high sampling, but easy to implement
# np.random.seed(seed)
# x = train.loc[ np.random.choice(train.index, batch_size) ].values
# iterate through shuffled indices, so every sample gets covered evenly
start_i = (batch_size * seed) % len(train)
stop_i = start_i + batch_size
shuffle_seed = (batch_size * seed) // len(train)
np.random.seed(shuffle_seed)
train_ix = np.random.choice(list(train.index), replace=False, size=len(train)) # wasteful to shuffle every time
train_ix = list(train_ix) + list(train_ix) # duplicate to cover ranges past the end of the set
x = train.loc[train_ix[start_i: stop_i]].values
return np.reshape(x, (batch_size, -1))
def train(self, data, train_arguments):
[cache_prefix, epochs, sample_interval] = train_arguments
data_cols = data.columns
# Adversarial ground truths
valid = np.ones((self.batch_size, 1))
fake = np.zeros((self.batch_size, 1))
for epoch in range(epochs):
# ---------------------
# Train Discriminator
# ---------------------
batch_data = self.get_data_batch(data, self.batch_size)
noise = tf.random.normal((self.batch_size, self.noise_dim))
# Generate a batch of new images
gen_data = self.generator.predict(noise)
# Train the discriminator
d_loss_real = self.discriminator.train_on_batch(batch_data, valid)
d_loss_fake = self.discriminator.train_on_batch(gen_data, fake)
d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)
# ---------------------
# Train Generator
# ---------------------
noise = tf.random.normal((self.batch_size, self.noise_dim))
# Train the generator (to have the discriminator label samples as valid)
g_loss = self.combined.train_on_batch(noise, valid)
# Plot the progress
print("%d [D loss: %f, acc.: %.2f%%] [G loss: %f]" % (epoch, d_loss[0], 100 * d_loss[1], g_loss))
# If at save interval => save generated events
if epoch % sample_interval == 0:
#Test here data generation step
# save model checkpoints
model_checkpoint_base_name = 'model/' + cache_prefix + '_{}_model_weights_step_{}.h5'
self.generator.save_weights(model_checkpoint_base_name.format('generator', epoch))
self.discriminator.save_weights(model_checkpoint_base_name.format('discriminator', epoch))
#Here is generating the data
z = tf.random.normal((432, self.noise_dim))
gen_data = self.generator(z)
print('generated_data')
def save(self, path, name):
assert os.path.isdir(path) == True, \
"Please provide a valid path. Path must be a directory."
model_path = os.path.join(path, name)
self.generator.save_weights(model_path) # Load the generator
return
def load(self, path):
assert os.path.isdir(path) == True, \
"Please provide a valid path. Path must be a directory."
self.generator = Generator(self.batch_size)
self.generator = self.generator.load_weights(path)
return self.generator
class Generator():
def __init__(self, batch_size):
self.batch_size=batch_size
def build_model(self, input_shape, dim, data_dim):
input= Input(shape=input_shape, batch_size=self.batch_size)
x = Dense(dim, activation='relu')(input)
x = Dense(dim * 2, activation='relu')(x)
x = Dense(dim * 4, activation='relu')(x)
x = Dense(data_dim)(x)
return Model(inputs=input, outputs=x)
class Discriminator():
def __init__(self,batch_size):
self.batch_size=batch_size
def build_model(self, input_shape, dim):
input = Input(shape=input_shape, batch_size=self.batch_size)
x = Dense(dim * 4, activation='relu')(input)
x = Dropout(0.1)(x)
x = Dense(dim * 2, activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(dim, activation='relu')(x)
x = Dense(1, activation='sigmoid')(x)
return Model(inputs=input, outputs=x)

Next, lets define the training parameters for the GAN network. We would be using a batch size of 32 and train it for 5000 epochs.

data_cols = df.columns
#Define the GAN and training parameters
noise_dim = 32
dim = 128
batch_size = 32
log_step = 100
epochs = 5000+1
learning_rate = 5e-4
models_dir = 'model'
df[data_cols] = df[data_cols]
print(df.shape[1])
gan_args = [batch_size, learning_rate, noise_dim, df.shape[1], dim]
train_args = ['', epochs, log_step]
11
mkdir: cannot create directorymodel’: File exists

Finally, let’s run the training and see if the model is able to learn something.

model = GAN
#Training the GAN model chosen: Vanilla GAN, CGAN, DCGAN, etc.
synthesizer = model(gan_args)
synthesizer.train(df, train_args)
Streaming output truncated to the last 5000 lines.
generated_data
101 [D loss: 0.324169, acc.: 85.94%] [G loss: 2.549267]
.
.
.
4993 [D loss: 0.150710, acc.: 95.31%] [G loss: 2.865143]
4994 [D loss: 0.159454, acc.: 95.31%] [G loss: 2.886763]
4995 [D loss: 0.159046, acc.: 95.31%] [G loss: 2.640226]
4996 [D loss: 0.150796, acc.: 95.31%] [G loss: 2.868319]
4997 [D loss: 0.170520, acc.: 95.31%] [G loss: 2.697939]
4998 [D loss: 0.161605, acc.: 95.31%] [G loss: 2.601780]
4999 [D loss: 0.156147, acc.: 95.31%] [G loss: 2.719781]
5000 [D loss: 0.164568, acc.: 95.31%] [G loss: 2.826339]
WARNING:tensorflow:Model was constructed with shape (32, 32) for input Tensor("input_1:0", shape=(32, 32), dtype=float32), but it was called on an input with incompatible shape (432, 32).
generated_data

After, 5000 epochs the models shows a training accuracy of 95.31% which sounds quite impressive.

!mkdir model/gan
!mkdir model/gan/saved
mkdir: cannot create directorymodel/gan’: File exists
mkdir: cannot create directorymodel/gan/saved’: File exists
#You can easily save the trained generator and loaded it aftwerwards
synthesizer.save('model/gan/saved', 'generator_patients')

Let’s take a look at the Generator and Discriminator models.

synthesizer.generator.summary()
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(32, 32)]                0         
_________________________________________________________________
dense (Dense)                (32, 128)                 4224      
_________________________________________________________________
dense_1 (Dense)              (32, 256)                 33024     
_________________________________________________________________
dense_2 (Dense)              (32, 512)                 131584    
_________________________________________________________________
dense_3 (Dense)              (32, 11)                  5643      
=================================================================
Total params: 174,475
Trainable params: 174,475
Non-trainable params: 0
_________________________________________________________________
synthesizer.discriminator.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(32, 11)]                0         
_________________________________________________________________
dense_4 (Dense)              (32, 512)                 6144      
_________________________________________________________________
dropout (Dropout)            (32, 512)                 0         
_________________________________________________________________
dense_5 (Dense)              (32, 256)                 131328    
_________________________________________________________________
dropout_1 (Dropout)          (32, 256)                 0         
_________________________________________________________________
dense_6 (Dense)              (32, 128)                 32896     
_________________________________________________________________
dense_7 (Dense)              (32, 1)                   129       
=================================================================
Total params: 170,497
Trainable params: 0
Non-trainable params: 170,497
_________________________________________________________________

Evaluation

Now, that we have trained the model let’s see if the generated data is similar to the actual data.

We plot the generated data for some of the model steps and see how the plot for the generated data changes as the networks learns the embedding more accurately.

models = {'GAN': ['GAN', False, synthesizer.generator]}
import matplotlib.pyplot as plt
# Setup parameters visualization parameters
seed = 17
test_size = 492 # number of fraud cases
noise_dim = 32
np.random.seed(seed)
z = np.random.normal(size=(test_size, noise_dim))
real = synthesizer.get_data_batch(train=df, batch_size=test_size, seed=seed)
real_samples = pd.DataFrame(real, columns=data_cols)
model_names = ['GAN']
colors = ['deepskyblue','blue']
markers = ['o','^']
col1, col2 = 'CITY', 'ETHNICITY'
base_dir = 'model/'
#Actual fraud data visualization
model_steps = [ 0, 100, 200, 300, 400, 500, 1000, 2000, 3000, 4000, 5000]
rows = len(model_steps)
columns = 5
axarr = [[]]*len(model_steps)
fig = plt.figure(figsize=(14,rows*3))
for model_step_ix, model_step in enumerate(model_steps):
axarr[model_step_ix] = plt.subplot(rows, columns, model_step_ix*columns + 1)
for group, color, marker in zip(real_samples.groupby('RACE'), colors, markers):
plt.scatter( group[1][[col1]], group[1][[col2]], marker=marker, edgecolors=color, facecolors='none' )
plt.title('Actual Patients Data')
plt.ylabel(col2) # Only add y label to left plot
plt.xlabel(col1)
xlims, ylims = axarr[model_step_ix].get_xlim(), axarr[model_step_ix].get_ylim()
if model_step_ix == 0:
legend = plt.legend()
legend.get_frame().set_facecolor('white')
i=0
[model_name, with_class, generator_model] = models['GAN']
generator_model.load_weights( base_dir + '_generator_model_weights_step_'+str(model_step)+'.h5')
ax = plt.subplot(rows, columns, model_step_ix*columns + 1 + (i+1) )
g_z = generator_model.predict(z)
gen_samples = pd.DataFrame(g_z, columns=data_cols)
gen_samples.to_csv('Generated_sample.csv')
plt.scatter( gen_samples[[col1]], gen_samples[[col2]], marker=markers[0], edgecolors=colors[0], facecolors='none' )
plt.title("Generated Data")
plt.xlabel(data_cols[0])
ax.set_xlim(xlims), ax.set_ylim(ylims)
plt.suptitle('Comparison of GAN outputs', size=16, fontweight='bold')
plt.tight_layout(rect=[0.075,0,1,0.95])
# Adding text labels for traning steps
vpositions = np.array([ i._position.bounds[1] for i in axarr ])
vpositions += ((vpositions[0] - vpositions[1]) * 0.35 )
for model_step_ix, model_step in enumerate( model_steps ):
fig.text( 0.05, vpositions[model_step_ix], 'training\nstep\n'+str(model_step), ha='center', va='center', size=12)
plt.savefig('Comparison_of_GAN_outputs.png')
No handles with labels found to put in legend.

png

Now let’s try to do a feature by feature comparision between the generated data and the actual data. We will use python’s table_evaluator library to compare the features.

!pip install table_evaluator
gen_df.drop('Unnamed: 0', axis=1, inplace=True)
print(gen_df.columns)
print(df.shape, gen_df.shape)
Index(['MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'CITY', 'STATE',
       'COUNTY', 'ZIP', 'HEALTHCARE_EXPENSES', 'HEALTHCARE_COVERAGE'],
      dtype='object')
(1171, 11) (492, 11)

We call the visual_evaluation method to compare the actual date(df) and the generated data(gen_df).

from table_evaluator import load_data, TableEvaluator
cat_cols = ['MARITAL', 'RACE', 'ETHNICITY', 'GENDER', 'BIRTHPLACE', 'CITY', 'STATE', 'COUNTY', 'ZIP']
print(len(df), len(gen_df))
table_evaluator = TableEvaluator(df, gen_df)
table_evaluator.visual_evaluation()
1171 492

png

png

/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:283: UserWarning: Data must have variance to compute a kernel density estimate.
  warnings.warn(msg, UserWarning)

png

png

png

Conclusion

Some of the features in the syntehtic data match closely with actual data but there are some other features which were not learnt perfectly by the model. We can keep playing with the model and its hyperparameters to improve the model further.

This post demonstrates that its fairly simply to use GANs to generate synthetic data where the actual data is sensitive in nature and can’t be shared publicly.

Vivek Maskara
Vivek Maskara
SDE @ Remitly

SDE @ Remitly | Graduated from MS CS @ ASU | Ex-Morgan, Amazon, Zeta

Related