Reputation: 23
I feed the conv2d with the same data but different batch size (using stack) as input:
a = torch.rand(1, 512, 16, 16) # (1, 512, 16, 16)
b = torch.cat([a, a, a], dim=0) # (3, 512, 16, 16)
a, b = a.cuda(), b.cuda()
net = nn.Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
net = net.cuda()
ay = net(a)
by = net(b)
print('ay[0], by[0] max diff', torch.max(torch.abs(ay[0] - by[0])).item())
print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0]))
the result is however different:
ay[0], by[0] diff 3.5762786865234375e-06
ay[0], by[0] allclose False
this problem is tested on Linux + V100 + torch1.9.0 + cu111, but so far many other configuration also seen such problem. Any clue why? Or it is just simply I misunderstand how conv2d should work?
I run into this problem when I validate my trainset result using batch size 1, but it is significantly different from the error I recorded from training process, so I checked for it and find that it is the conv2d layer that causes this problem. If I understand conv2d correctly, this should not be happening.
Upvotes: 2
Views: 383
Reputation: 583
As far as I know the problem is not specific to con2d operations, but rather do to a limited floating point precision which can vary depending on the operations and architecture. This is a known issue, see e.g. this discussion on the pytorch-forum.
The GPU calculations you are currently running is probably using single-precision float computations, if you set it to be double-precision the error discrepancies should be reduced:
torch.set_default_tensor_type(torch.DoubleTensor)
meaning that:
print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0], atol=1e-6))
Should print:
ay[0], by[0] allclose True
At least this is the case for me when testing also on Linux using an A100.
Upvotes: 2