Why conv2d yields different results with different batch size

Question

I feed the conv2d with the same data but different batch size (using stack) as input:

a = torch.rand(1, 512, 16, 16)  # (1, 512, 16, 16)
b = torch.cat([a, a, a], dim=0) # (3, 512, 16, 16)

a, b = a.cuda(), b.cuda()

net = nn.Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
net = net.cuda()

ay = net(a)
by = net(b)

print('ay[0], by[0] max diff', torch.max(torch.abs(ay[0] - by[0])).item())
print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0]))

the result is however different:

ay[0], by[0] diff 3.5762786865234375e-06
ay[0], by[0] allclose False

this problem is tested on Linux + V100 + torch1.9.0 + cu111, but so far many other configuration also seen such problem. Any clue why? Or it is just simply I misunderstand how conv2d should work?

I run into this problem when I validate my trainset result using batch size 1, but it is significantly different from the error I recorded from training process, so I checked for it and find that it is the conv2d layer that causes this problem. If I understand conv2d correctly, this should not be happening.

i regular · Accepted Answer

As far as I know the problem is not specific to con2d operations, but rather do to a limited floating point precision which can vary depending on the operations and architecture. This is a known issue, see e.g. this discussion on the pytorch-forum.

The GPU calculations you are currently running is probably using single-precision float computations, if you set it to be double-precision the error discrepancies should be reduced:

torch.set_default_tensor_type(torch.DoubleTensor)

meaning that:

print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0], atol=1e-6))

Should print:

ay[0], by[0] allclose True

At least this is the case for me when testing also on Linux using an A100.

Why conv2d yields different results with different batch size

Answers (1)

Related Questions