Reputation: 4482
I have the following randomly generated data
import numpy as np
from keras import models,layers
from keras import applications
from sklearn.model_selection import train_test_split
data = np.random.normal(100, 10, 100) # generate 100 numbers
Which I split into train and test
data_train, data_test = train_test_split(data, test_size=0.33) # split into train and test
I want to train an autoendoder model on these data, in order to figure their underlying distribution.
So, with the help of this post I am building my model
embedding_dim = 42 # dimensionality of the latents space
#Input layer
input_data = layers.Input(shape=(1,))
#Encoding layer
encoded = layers.Dense(embedding_dim, activation='relu')(input_data)
#Decoding layer
decoded = layers.Dense(1,activation='linear')(encoded)
#Autoencoder --> in this API Model, we define the Input tensor and the output layer
#wraps the 2 layers of Encoder e Decoder
autoencoder = models.Model(input_data,decoded)
autoencoder.summary()
#Encoder
encoder = models.Model(input_data,encoded)
#Decoder
encoded_input = layers.Input(shape=(embedding_dim,))
decoder_layers = autoencoder.layers[-1] #applying the last layer
decoder = models.Model(encoded_input,decoder_layers(encoded_input))
autoencoder.compile(
optimizer='adadelta', #backpropagation Gradient Descent
loss='binary_crossentropy'
)
history = autoencoder.fit(data_train,data_train,
epochs=50,batch_size=256,shuffle=True,
validation_data=(data_test,data_test))
and in the end I am doing the predictions
# do predictions
predictions = encoder.predict(data_test)
predictions = decoder.predict(predictions)
predictions
Remember, the task is to figure their underlying distribution and then create more data out of it. I have a couple of questions with this (naive) approach:
embedding_dim = 42
in this case). Though, the input data are of shape 1
. How does this work then ? Because I had the feeling that the autoencoder "shrinks" the original dimension first, and then recreates the data, using the shrinked dimensions, and that is the why, the output data are "de-noised".test set
, so I generate 33 predictions. My question is, since the autoencoder has "figured" the underlying distribution of the data, is there a way to generate more than 33 predictions ?Upvotes: 3
Views: 1175
Reputation: 3780
Supposed the model has learned the distribution, we can draw samples from the latent space L. dim(L)=embedding_dim
Every point in L will result in a prediction and here we meet our first problems.
That means there is also an infinity large amount of samples we could draw and it is more than unlikely that they all will result in something that is usable.
But there will be regions that do yield good results, these are the ones that we get from the encoder.
We need a way so somehow simulate an encoder output our know where it is.
NOTE: The following sections are more important for categorical features, using a single distribution we should get rather continuous results in one region and not multiple clusters.
With activation function and normalization we can shrink down the values to a reasonable range if we use activation functions and BatchNormalization Layer. Activation functions will also add non-linearity to our model which we need to model non linear functions.
If we want to generate outputs, we need to avoid "blank" space in the latent space that decodes into rubbish. With regularization we can bring useful areas closer together and enlarge them. This is again a trade-off with quality, but as regularization also helps against over fitting and also decreases the weights => which shrinks down the space of possible values again. Regularization is one of the most important things to generate a latent space that can be used as a sample space.
(Source of the image and also good article about VAE and latent space: Understanding Variational Autoencoders)
Lets say we brought the values down extremely to.
The sample space will still have the size of [-1, 1]^embedding_dim
which can be quite large depending on it's dimension!
Here we need some trade-off:
In short the latent dimension of a variational autoencoder should be as low as possible, but how low, depends on the setting.
In theory we can think of the latent space to hold the latent variables of the input/model, which then can be reconstructed.
For a normal distribution we would think of 2 variables, right? Mean and variance. So choose embedding_dim=2
?
Rather NO, embedding_dim=1
should be enough.
The decoder has the potential to generalize the output in the bias term of the layer, so the dimension of the latent space can be smaller than the true value of latent variables BUT the generated outputs could lack variation.
In case of a normal distribution or others where the mean is constant, we can expect the decoder to learn the mean.
I did some research in that direction as well.
Some other sources:
The VAE that I created for here is based on these two tutorials:
Most important changes:
Outputs have no activation function, as the data distribution was taken as is.
As there is no preprocessing like normalization then network needs to be deeper. With even more layers and tweaks we can normalize the output of the encoder but a nicer input data has much stronger effect.
Therefore exchanged crossentropy loss with mean squared error. To handle arbitrary large outputs
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
# Your distribution
latent_dim = 1
data = np.random.normal(100, 10, 100) # generate 100 numbers
data_train, data_test = data[:-33], data[-33:]
# Note I took the distribution raw, some preprocessing should help!
# Like normalizing it and later apply on the output
# to get the real distribution back
class Sampling(layers.Layer):
"""Uses (z_mean, z_log_var) to sample z, the vector encoding a digit."""
def call(self, inputs):
z_mean, z_log_var = inputs
batch = tf.shape(z_mean)[0]
dim = tf.shape(z_mean)[1]
epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
return z_mean + tf.exp(0.5 * z_log_var) * epsilon
latent_dim = 1
# =============================================================================
# Encoder
# There are many valid configurations of hyperparameters,
# Here it is also doable without Dropout, regularization and BatchNorm
# =============================================================================
encoder_inputs = keras.Input(shape=(1,))
x = layers.BatchNormalization()(encoder_inputs)
x = layers.Dense(200, activation="relu", activity_regularizer="l2")(x)
x = tf.keras.layers.Dropout(0.1)(x)
x = layers.Dense(200, activation="relu", activity_regularizer="l2")(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(50, activation="relu", activity_regularizer="l2")(x)
# Splitting into mean and variance
z_mean = layers.Dense(latent_dim, name="z_mean", activity_regularizer="l2")(x)
z_mean = layers.BatchNormalization()(z_mean)
z_log_var = layers.Dense(latent_dim, activation="relu", name="z_log_var")(x)
z_log_var = layers.BatchNormalization()(z_log_var)
# Creat the sampling layer
z = Sampling()([z_mean, z_log_var])
encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")
# =============================================================================
# Decoder
# Contrary to other Architectures we don't aim for a categorical output
# in a range of 0...Y so linear activation in the end
# NOTE: Normalizing the training data allows the use of other functions
# but I did not test that.
# =============================================================================
latent_inputs = keras.Input(shape=(latent_dim,))
x = layers.Dense(50, activation="relu")(latent_inputs)
x = layers.Dense(200, activation="relu")(x)
x = layers.Dense(200, activation="relu")(x)
x = layers.Dense(200, activation="linear")(x)
x = layers.Dense(1, activation="linear")(x)
decoder = keras.Model(latent_inputs, x, name="decoder")
# =============================================================================
# Create a model class
# =============================================================================
class VAE(keras.Model):
def __init__(self, encoder, decoder, **kwargs):
super(VAE, self).__init__(**kwargs)
self.encoder = encoder
self.decoder = decoder
self.total_loss_tracker = keras.metrics.Mean(name="total_loss")
self.reconstruction_loss_tracker = keras.metrics.Mean(
name="reconstruction_loss"
)
self.kl_loss_tracker = keras.metrics.Mean(name="kl_loss")
@property
def metrics(self):
return [
self.total_loss_tracker,
self.reconstruction_loss_tracker,
self.kl_loss_tracker,
]
@tf.function
def sample(self, amount=None, eps=None):
if eps is None:
eps = tf.random.normal(shape=(amount or 50, latent_dim))
return self.decode(eps, apply_sigmoid=False)
def encode(self, x):
mean, logvar, z = self.encoder(x)
return mean, logvar, z
def reparameterize(self, mean, logvar):
eps = tf.random.normal(shape=mean.shape)
return eps * tf.exp(logvar * .5) + mean
def decode(self, z, apply_sigmoid=False):
logits = self.decoder(z)
if apply_sigmoid:
probs = tf.sigmoid(logits)
return probs
return logits
def train_step(self, data):
with tf.GradientTape() as tape:
z_mean, z_log_var, z = self.encode(data)
#z = self.reparameterize(z_mean, z_log_var)
reconstruction = self.decoder(z)
reconstruction_loss = tf.reduce_sum(keras.losses.mean_squared_error(data, reconstruction))
kl_loss = -0.5 * (1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
kl_loss = tf.reduce_sum(kl_loss, axis=1)
total_loss = reconstruction_loss + kl_loss
grads = tape.gradient(total_loss, self.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
self.total_loss_tracker.update_state(total_loss)
self.reconstruction_loss_tracker.update_state(reconstruction_loss)
self.kl_loss_tracker.update_state(kl_loss)
return {
"loss": self.total_loss_tracker.result(),
"reconstruction_loss": self.reconstruction_loss_tracker.result(),
"kl_loss": self.kl_loss_tracker.result(),
}
# =============================================================================
# Training
# EarlyStopping is strongly recommended here
# but sometimes gets stuck early
# Increase the batch size if there are more samples availaible!
# =============================================================================
vae = VAE(encoder, decoder)
callback = tf.keras.callbacks.EarlyStopping(monitor='loss',
patience=10,
restore_best_weights=False)
vae.compile(optimizer=keras.optimizers.Adam())
vae.fit(data_train, epochs=100, batch_size=11, callbacks=[callback])
"""
Last Epoch 33/100
7/7 [===] - 2ms/step
- loss: 2394.6672
- reconstruction_loss: 1130.7889
- kl_loss: 1224.3684
"""
encoded_train = encoder.predict(data_train)
plt.hist(data_train, alpha=0.5, label="Train")
plt.hist(decoder.predict(encoded_train).flatten(), alpha=0.75, label="Output")
plt.legend()
plt.show()
encoded = encoder.predict(data_test)
#print(encoded)
plt.hist(data_test, alpha=0.5, label="Test")
plt.hist(decoder.predict(encoded).flatten(), label="Output", alpha=0.5)
plt.legend()
plt.show()
Everything is shifted a bit to the left.
The mean was not learned ideally but nearly perfect
As mentioned above, not comes the tricky part how to sample from out latent space.
Ideally the latent space would be centered around 0 and we could sample from a normal space.
But as we still have our training data, we can check their encoding:
>>>encoded_train[0].mean()
-43.1251
encoded_train[0].std()
>>>4.4563518
These numbers could be arbitrary but it's nice to see that the std is rather low.
Lets plug these in and compare 1500 real vs 1500 generated samples:
sample = vae.sample(eps=tf.random.normal((15000, latent_dim),
encoded_train[0].mean(axis=0),
encoded_train[0].std(axis=0))).numpy()
plt.hist(np.random.normal(100, 10, 15000), alpha=0.5, label="Real Distribution", bins=20)
plt.hist(sample,
alpha=0.5, label="Sampled", bins=20)
plt.legend()
plt.show()
Looks very good doesn't it?
>>>sample.std()
10.09742
>>>sample.mean()
97.27115
Very close the original distribution.
Note these are a bit empirical and due to randomness and early stopping not always consistent BUT increasing the latent space, will gradually make it harder to generate good samples.
As you can see the mean still works good, but we lack variance, we need to upscale it and need a better estimate for it.
I'm a bit surprised that up scaling the variance really works.
Compared to, for example the MNIST digits, where multiple clusters exists inside the latent space that generate particular good outputs, here exist one and with the estimator from the training data we even know where it is.
Adding some prior to mean and variance should further improve the results (on the cost of bias).
Upvotes: 4
Reputation: 788
I understand you're trying to use an auto-encoder to learn a data distribution which will then allow you to create new samples out of this distribution.
First question: An auto-encoder projects your features to a latent space while learning non-linear relationships between these features.
In your case, your random samples don't have any underlying n-dimensional structure, so projecting your datapoints to a space of size embedding_dim
won't provide you with good results. It would be the same for a PCA. The decoder part won't be able to recreate data without a big loss.
I would recommend you to perform your test on more meaningful data to be able to test such model. Then, choosing embedding_dim
is a matter of capturing non-linear interactions in your input dimensions with the risk of overfitting if embedding_dim
is too high.
Second question: Once you have trained your AE, one solution is to give the decoder a random input of values between 0 and 1. Then, it will give you new samples from the learned distribution.
However, you have no guarantee these generated samples are representative of the original data because you would need to sample the right part of the distribution. This would require you to have an approach to input carefully selected values as inputs for your decoder.
Note: I would add that you should take a look a Variational AutoEncoders which have better properties for capturing distributions.
Resources:
Upvotes: 1