Reputation: 1889

Should batch size matter at inference

I am training a model

5-layer very narrow CNN,
followed by a 5-layer highway,
then fully connected and
softmax over 7 classes.

As there are 7 equally distributed classes, the random bet lassification accuracy would be 14 % (1/7th is roughly 14 %). Actual accuracy is 40 %. So the net somewhat learns..

Now the weird thing is it learns only with a batch size of 2. Batch sizes of 16, 32 or 64 don't learn at all.

Now the even weirder thing: If I take the checkpoint of the trained net (accuracy 40 %, trained at batch size 2) and start it with a batch size of 32 I should keep on getting my 40 % at least for the first couple of steps, right? I do when I restart at barch size 2. But with the bs 32 initial accuracy is, guess what, 14 %.

Any idea why the batch size would ruin inference? I fear I might be having a shape error somewhere but I cannot find anything.

Thx for your thoughts

Upvotes: 3

Answers (3)

SomethingSomething

Reputation: 12196

There are two possible modes of work for a batch normalization layer at inference time:

Compute the activation mean and variance over a given inference batch
Use average mean and variance from the training batches

In Pytorch, for example, the track_running_stats parameter of the BatchNorm2D layer is set to True, or in other words, Option 2 is the default:

If you choose Option 1, then of course, the size of the inference batch and the characteristics of each sample in it will affect the outputs of the other samples.

So γ and β are learned in the training and used in inference as is, and if you do not change the default behavior, the "same" is true for E[x] and Var[x]. In purpose I wrote "same" within quotes, as these are just batch statistics from the training.

If we're already talking about batch size, I'd mention that it may sound tempting to use very large batch sizes in training, to have more accurate statistics and to have a better approximation of the loss function for an SGD step. Yet, approximating the loss function too well has drawbacks, such as overfitting.

Upvotes: 2

David Wong

Reputation: 748

You should look at the accuracy when your model converges, not when it is still training. It's hard to compare effects of different batch size during training steps because they can get "lucky" and follow a good gradient path. In general smaller batch size tends to be more noisy and could give you good peaks and bad drops in accuracy.

Upvotes: 1

Gregory Begelman

Reputation: 554

It's hard to tell without looking at the code, but I think that large batch sizes cause the gradient to be too large and the training cannot converge. One way to fight this effect would be to increase the batch size but decrease the learning rate. You can also try to clip gradient magnitude.

Upvotes: 0

Should batch size matter at inference

Answers (3)

Related Questions