Reputation: 21
I have a dataset of images consisting of three splits - the training, validation and test splits, and want to normalize the dataset to make training easier. Hence I want to find the mean and standard deviation of RGB values from the available data.
The doubt I have is - should I consider all the splits for normalizing?
My personal thought is that only the training split should be used since it is assumed to be the only data that we have to train the model. Hence the model is provided inputs from the distribution of the training data, leaving errors that can be picked by evaluation on the validation split. If I provide the distribution to a network from data outside what is provided for training, would it not be feeding the network extra information than what it is supposed to learn from?
Any other way to do this would also be of help. For example, is it just better to use standard values for RGB?
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
(Soure: Pytorch Torchvision Transforms)
Upvotes: 0
Views: 1954
Reputation: 2428
only the training split should be used since it is assumed to be the only data that we have to train the model
Correct. And don't forget to scale the validation and test set using the mean and variance of the training set, rather than their own mean and variance. Otherwise you introduce a domain shift.
is it just better to use standard values for RGB
Results will be slightly better or slightly worse, but probably not very different if everything else (learning rate, weight initialization) is optimal.
Upvotes: 1
Reputation: 461
The doubt I have is - should I consider all the splits for normalizing?
As you said, in theory you should only make use of training data for anything, even for normalization.
Any other way to do this would also be of help. For example, is it just better to use standard values for RGB?
In practice, probably yes. In fact, it shouldn't really matter how you normalize your data, you could even go for mean=0.5, std=0.5 for each channel. Or even adopt a -127/+127 range, the network should adapt to whatever input you provide during training.
What you should probably bear in mind is practical use and application: if you're dealing with pretrained networks, they are usually provided with ImageNet normalization (the one you suggested). This is common practice since:
TLDR: the choice on custom or "standard" normalization depends on the task itself. In practice, normalization shouldn't matter very much, you should be fine in both cases. You have a decently sized set and time to compute some statistics? Go for custom values. Not so much time for statistics or the dataset is quite small? Probably better to go for the safe ImageNet approach.
Upvotes: 1