PyTorch - How should you normalize individual instances

Question

I am using PyTorch to train a linear regression model. I trained this model using a dataset of 200 drawings, represented by several interesting features. Because all features work on a different scale, I decided to normalize my training data in order to get better results. The labels associated with these drawings indicate how good they are perceived by the public. This all went well and I already got a pretty consistent model knowing I only had a training set of 200 drawings. My code below for more details:

# Reading the data
data = pd.read_csv('dataset.csv')
drawings = paintings_frame.iloc[:n, 1:]
labels = paintings_frame.iloc[:n, 0]

# Making sure it's in the right format
drawings_numpy = drawings.values.astype(np.float32)
labels_numpy = labels.values.astype(np.float32)
labels_numpy = labels_numpy.reshape(-1,1)

# Normalizing
scaler = MinMaxScaler()
drawings_numpy = scaler.fit_transform(drawings_numpy)

# Converting to Tensor datasets
inputs = torch.tensor(drawings_numpy)
targets = torch.tensor(drawingss_numpy)

# Loading it into the model
input_size = inputs.shape[1]
output_size = 1
model = nn.Linear(input_size, output_size)

My code then continues with defining the loss and optimizer and defining the training loop. But I suppose this is the most relevant part for this question. So after having trained and saved my model, I now obviously would like to use this model to predict the labels of new given drawings. However, and correct me if I'm wrong, it seems to me I would have to normalize any drawing I now provide to my model to make a prediction the same way I did with the original trainingset, correct? If so, I'm not an expert on how this normalization exactly works, but I assume the way data gets normalized depends on how the data behaves (eg. the minimum and maximum value a single feature can have in the dataset). If this is the case, I feel like I cannot simply normalize my single instances I now want a prediction of by simply calling the same functions as I used for my trainingset. Would somebody be able to shed some light on how this would exactly work, or if I'm making a mistake in my reasoning?

dumbPy · Accepted Answer

You are correct about this. The scaling would depend on how the data behaves in a given feature, i.e., it's distribution or just min/max value in this case.
Since a test instance is not a good representation of the underlying distribution but the train data is (assumed and should be), you save the parameters of the scaler for future use.
I would suggest going through the MinMaxScaler's and other scaler's documentation here

the get_params method returns the parameters that you save, and then use set_params to get the same scaler during inference, instead of fitting a new scaler on the test data.

PyTorch - How should you normalize individual instances

Answers (1)

Related Questions