Dave
Dave

Reputation: 564

Questions regarding to a simple autoencoder implementation

I have the following simple autoencoder that I created to be used as a dimensionality reduction for data. The input data contains 10K samples of integer values, where the class is either 0 or 1:

import numpy as np
import pandas as pd
from keras import Model, Input
from keras.layers import Dense
from sklearn.model_selection import train_test_split


def construct_network(X_train):
    input_dim = X_train.shape[1]
    neurons = 64
    input_layer = Input(shape=(input_dim,))
    encoded1 = Dense(neurons, activation='relu')(input_layer)
    encoded = Dense(int(neurons / 2), activation='relu')(encoded1)
    decoded1 = Dense(neurons, activation='relu')(encoded)
    output_layer = Dense(input_dim, activation='linear')(decoded1)

    autoencoder = Model(inputs=input_layer, outputs=output_layer)
    return autoencoder

data, labels = read_data('/Users/A/datasets/data.csv')
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)
autoencoder = construct_network(X_train)
autoencoder.compile(optimizer='adam', loss='mse', metrics=['acc'])
history = autoencoder.fit(X_train, X_train,
                          epochs=100,
                          batch_size=64,
                          validation_split=0.2,
                          use_multiprocessing=True)
y_pred = autoencoder.predict(X_test, use_multiprocessing=True)
mse_per_sample = np.mean(np.power(X_test - y_pred, 2), axis=1)
error = pd.DataFrame({'error': mse_per_sample, 'true_label': y_test})
print(error)

I have two questions:

  1. Is the choice loss='mse' suitable for this problem?
  2. How can I calculate the percentage of the corrected predicted values between mse_per_sample and y_test in the last line error = pd.DataFrame({'error': mse_per_sample, 'true_label': y_test})

Thank you

Upvotes: 2

Views: 167

Answers (1)

R-Strange
R-Strange

Reputation: 180

I'll start with the second question and use it to explain the first. An auto-encoder tries to take a tensor of input values, reduce their dimensionality and then approximate the input again with information it has left. As it is trying to approximate a quantitative target rather than a qualitative label it needs to regress these values.

The implications of this is that we can't simply group things into a "right" and "not right" bucket, but rather we see how closely our values match with the target values. If we just had "right" and "wrong" we wouldn't learn on how close to correct we are - for a 22 target 21.963 would be just as wrong as 1.236. Furthermore, your regressed values will very rarely land right on the nose of the right value, so you're not capturing the performance of the model well.

So if there is no simple right and wrong, how do we measure the performance of the model? We look at the distance between the predicted and actual values, and use that to calculate the error of the measurement. Taking an average of the errors gives us our first metric - Mean Absolute Error (MAE). This is an L1 measurement, but it's often quite choppy, so we want a smoother measurement. By squaring the value we get Mean Squared Error (MSE) which behaves more predictably, and is the standard regression loss function. (Honourable mension to Mean Squared log Error (MSLE or MSLogE) which squares the log of the error.)

MSE is your go-to, but requires a gaussian distribution. MSLogE is the same, but handles large target values better, and MAE can handle semi-gaussian distributions. That being said if you're standardising or normalising your input you should usually have a gaussian distribution anyway.

If you must have an "accuracy" statistic, decide on your acceptable level of error, and create a filter mask in your Dataframe for values above and below that threshold. Then it's a matter of simply calculating the number of values below the threshold over the total number of values.

Upvotes: 2

Related Questions