quant
quant

Reputation: 4482

Autoencoders for figuring the underlying data distribution in python

I have the following randomly generated data

import numpy as np
from keras import models,layers
from keras import applications
from sklearn.model_selection import train_test_split

data = np.random.normal(100, 10, 100) # generate 100 numbers

Which I split into train and test

data_train, data_test = train_test_split(data, test_size=0.33) # split into train and test

I want to train an autoendoder model on these data, in order to figure their underlying distribution.

So, with the help of this post I am building my model

embedding_dim = 42 # dimensionality of the latents space 

#Input layer
input_data = layers.Input(shape=(1,))  

#Encoding layer
encoded = layers.Dense(embedding_dim, activation='relu')(input_data)

#Decoding layer
decoded = layers.Dense(1,activation='linear')(encoded) 

#Autoencoder --> in this API Model, we define the Input tensor and the output layer
#wraps the 2 layers of Encoder e Decoder
autoencoder = models.Model(input_data,decoded)
autoencoder.summary()

#Encoder
encoder = models.Model(input_data,encoded)

#Decoder
encoded_input = layers.Input(shape=(embedding_dim,))
decoder_layers = autoencoder.layers[-1]  #applying the last layer
decoder = models.Model(encoded_input,decoder_layers(encoded_input))

autoencoder.compile(
    optimizer='adadelta',  #backpropagation Gradient Descent
    loss='binary_crossentropy'
)

history = autoencoder.fit(data_train,data_train,
                          epochs=50,batch_size=256,shuffle=True,
                validation_data=(data_test,data_test))

and in the end I am doing the predictions

# do predictions
predictions = encoder.predict(data_test) 
predictions = decoder.predict(predictions)  
predictions

Remember, the task is to figure their underlying distribution and then create more data out of it. I have a couple of questions with this (naive) approach:

Upvotes: 3

Views: 1175

Answers (2)

Daraan
Daraan

Reputation: 3780

Variational Autoencoder

Initial thoughts and problems

Supposed the model has learned the distribution, we can draw samples from the latent space L. dim(L)=embedding_dim

Every point in L will result in a prediction and here we meet our first problems.

  • a) The latent space is infinitely large
  • b) and there are multiple dimensions

That means there is also an infinity large amount of samples we could draw and it is more than unlikely that they all will result in something that is usable.
But there will be regions that do yield good results, these are the ones that we get from the encoder. We need a way so somehow simulate an encoder output our know where it is.

NOTE: The following sections are more important for categorical features, using a single distribution we should get rather continuous results in one region and not multiple clusters.


Narrowing down the Sample Space

Normalization and Activation Function

With activation function and normalization we can shrink down the values to a reasonable range if we use activation functions and BatchNormalization Layer. Activation functions will also add non-linearity to our model which we need to model non linear functions.

Regularization

SOURCE: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73 If we want to generate outputs, we need to avoid "blank" space in the latent space that decodes into rubbish. With regularization we can bring useful areas closer together and enlarge them. This is again a trade-off with quality, but as regularization also helps against over fitting and also decreases the weights => which shrinks down the space of possible values again. Regularization is one of the most important things to generate a latent space that can be used as a sample space.

(Source of the image and also good article about VAE and latent space: Understanding Variational Autoencoders)


Choosing the latent space dimension

Lets say we brought the values down extremely to. The sample space will still have the size of [-1, 1]^embedding_dim which can be quite large depending on it's dimension!

Here we need some trade-off:

  • Higher dimensional space has more capacity to yield good results; given a good sample - but lowers the chance to find a good sample.
  • Lower dimensional space increases the chance to find a good sample; but their quality might be lower.

In short the latent dimension of a variational autoencoder should be as low as possible, but how low, depends on the setting.


In theory we can think of the latent space to hold the latent variables of the input/model, which then can be reconstructed.
For a normal distribution we would think of 2 variables, right? Mean and variance. So choose embedding_dim=2?
Rather NO, embedding_dim=1 should be enough.

The latent space can be smaller than the amount of latent variables:

The decoder has the potential to generalize the output in the bias term of the layer, so the dimension of the latent space can be smaller than the true value of latent variables BUT the generated outputs could lack variation.
In case of a normal distribution or others where the mean is constant, we can expect the decoder to learn the mean.

I did some research in that direction as well.

Some other sources:


Create a model:

The VAE that I created for here is based on these two tutorials:

Most important changes:

  • Outputs have no activation function, as the data distribution was taken as is.

  • As there is no preprocessing like normalization then network needs to be deeper. With even more layers and tweaks we can normalize the output of the encoder but a nicer input data has much stronger effect.

  • Therefore exchanged crossentropy loss with mean squared error. To handle arbitrary large outputs

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

# Your distribution

latent_dim = 1
data = np.random.normal(100, 10, 100) # generate 100 numbers
data_train, data_test = data[:-33], data[-33:]

# Note I took the distribution raw, some preprocessing should help!
# Like normalizing it and later apply on the output
# to get the real distribution back

class Sampling(layers.Layer):
    """Uses (z_mean, z_log_var) to sample z, the vector encoding a digit."""

    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = tf.shape(z_mean)[0]
        dim = tf.shape(z_mean)[1]
        epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
        return z_mean + tf.exp(0.5 * z_log_var) * epsilon


latent_dim = 1

# =============================================================================
# Encoder
# There are many valid configurations of hyperparameters,
# Here it is also doable without Dropout, regularization and BatchNorm
# =============================================================================

encoder_inputs = keras.Input(shape=(1,))
x = layers.BatchNormalization()(encoder_inputs)
x = layers.Dense(200, activation="relu", activity_regularizer="l2")(x)
x = tf.keras.layers.Dropout(0.1)(x)
x = layers.Dense(200, activation="relu", activity_regularizer="l2")(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(50, activation="relu", activity_regularizer="l2")(x)

# Splitting into mean and variance
z_mean = layers.Dense(latent_dim, name="z_mean", activity_regularizer="l2")(x)
z_mean = layers.BatchNormalization()(z_mean)

z_log_var = layers.Dense(latent_dim, activation="relu",  name="z_log_var")(x)
z_log_var = layers.BatchNormalization()(z_log_var)


# Creat the sampling layer
z = Sampling()([z_mean, z_log_var])
encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")

# =============================================================================
# Decoder
# Contrary to other Architectures we don't aim for a categorical output 
# in a range of 0...Y so linear activation in the end
# NOTE: Normalizing the training data allows the use of other functions 
# but I did not test that.
# =============================================================================

latent_inputs = keras.Input(shape=(latent_dim,))
x = layers.Dense(50, activation="relu")(latent_inputs)
x = layers.Dense(200, activation="relu")(x)
x = layers.Dense(200, activation="relu")(x)
x = layers.Dense(200, activation="linear")(x)
x = layers.Dense(1, activation="linear")(x)

decoder = keras.Model(latent_inputs, x, name="decoder")

# =============================================================================
# Create a model class
# =============================================================================

class VAE(keras.Model):
    def __init__(self, encoder, decoder, **kwargs):
        super(VAE, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
        self.reconstruction_loss_tracker = keras.metrics.Mean(
            name="reconstruction_loss"
        )
        self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
        ]

    @tf.function
    def sample(self, amount=None, eps=None):
      if eps is None:
        eps = tf.random.normal(shape=(amount or 50, latent_dim))
      return self.decode(eps, apply_sigmoid=False)
  
    def encode(self, x):
        mean, logvar, z = self.encoder(x)
        return mean, logvar, z
  
    def reparameterize(self, mean, logvar):
      eps = tf.random.normal(shape=mean.shape)
      return eps * tf.exp(logvar * .5) + mean
  
    def decode(self, z, apply_sigmoid=False):
      logits = self.decoder(z)
      if apply_sigmoid:
        probs = tf.sigmoid(logits)
        return probs
      return logits

    def train_step(self, data):
        with tf.GradientTape() as tape:
            z_mean, z_log_var, z = self.encode(data)
            #z = self.reparameterize(z_mean, z_log_var)
            reconstruction = self.decoder(z)
            reconstruction_loss = tf.reduce_sum(keras.losses.mean_squared_error(data, reconstruction))
            kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
            kl_loss = tf.reduce_sum(kl_loss, axis=1)
            total_loss = reconstruction_loss + kl_loss
        grads = tape.gradient(total_loss, self.trainable_weights)
        self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)
        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

# =============================================================================
# Training
# EarlyStopping is strongly recommended here
# but sometimes gets stuck early
# Increase the batch size if there are more samples availaible!
# =============================================================================

vae = VAE(encoder, decoder)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', 
                                            patience=10, 
                                            restore_best_weights=False)

vae.compile(optimizer=keras.optimizers.Adam())
vae.fit(data_train, epochs=100, batch_size=11, callbacks=[callback])

"""
Last Epoch 33/100
7/7 [===] - 2ms/step 
- loss: 2394.6672 
- reconstruction_loss: 1130.7889 
- kl_loss: 1224.3684
"""

Evaluation (Time for plots!)

encoded_train = encoder.predict(data_train)
plt.hist(data_train, alpha=0.5, label="Train")
plt.hist(decoder.predict(encoded_train).flatten(), alpha=0.75, label="Output")
plt.legend()
plt.show()

encoded = encoder.predict(data_test)
#print(encoded)
plt.hist(data_test, alpha=0.5, label="Test")
plt.hist(decoder.predict(encoded).flatten(), label="Output", alpha=0.5)
plt.legend()
plt.show()

Training Data and Autoencoder output

Training distribution
Everything is shifted a bit to the left. The mean was not learned ideally but nearly perfect

Test Data and Autoencoder output

Test distribution
Nearly perfect as well.

Sampling Data

As mentioned above, not comes the tricky part how to sample from out latent space.
Ideally the latent space would be centered around 0 and we could sample from a normal space. But as we still have our training data, we can check their encoding:

>>>encoded_train[0].mean()
-43.1251

encoded_train[0].std()
>>>4.4563518

These numbers could be arbitrary but it's nice to see that the std is rather low.

Lets plug these in and compare 1500 real vs 1500 generated samples:

sample = vae.sample(eps=tf.random.normal((15000, latent_dim), 
                                         encoded_train[0].mean(axis=0), 
                                         encoded_train[0].std(axis=0))).numpy()

plt.hist(np.random.normal(100, 10, 15000), alpha=0.5, label="Real Distribution", bins=20)
plt.hist(sample, 
         alpha=0.5, label="Sampled", bins=20)
plt.legend()
plt.show()

enter image description here

Looks very good doesn't it?

>>>sample.std()
10.09742

>>>sample.mean()
97.27115

Very close the original distribution.


Increasing the dimension of the latent space

Note these are a bit empirical and due to randomness and early stopping not always consistent BUT increasing the latent space, will gradually make it harder to generate good samples.
As you can see the mean still works good, but we lack variance, we need to upscale it and need a better estimate for it.

I'm a bit surprised that up scaling the variance really works. Compared to, for example the MNIST digits, where multiple clusters exists inside the latent space that generate particular good outputs, here exist one and with the estimator from the training data we even know where it is.
Adding some prior to mean and variance should further improve the results (on the cost of bias).

enter image description here

Upvotes: 4

fpajot
fpajot

Reputation: 788

I understand you're trying to use an auto-encoder to learn a data distribution which will then allow you to create new samples out of this distribution.

First question: An auto-encoder projects your features to a latent space while learning non-linear relationships between these features.

In your case, your random samples don't have any underlying n-dimensional structure, so projecting your datapoints to a space of size embedding_dim won't provide you with good results. It would be the same for a PCA. The decoder part won't be able to recreate data without a big loss.

I would recommend you to perform your test on more meaningful data to be able to test such model. Then, choosing embedding_dim is a matter of capturing non-linear interactions in your input dimensions with the risk of overfitting if embedding_dim is too high.

Second question: Once you have trained your AE, one solution is to give the decoder a random input of values between 0 and 1. Then, it will give you new samples from the learned distribution.

However, you have no guarantee these generated samples are representative of the original data because you would need to sample the right part of the distribution. This would require you to have an approach to input carefully selected values as inputs for your decoder.

Note: I would add that you should take a look a Variational AutoEncoders which have better properties for capturing distributions.

Resources:

Upvotes: 1

Related Questions