Why does pytorch matmul get different results when executed on cpu and gpu?

Question

I am trying to figure out the rounding difference between numpy/pytorch, gpu/cpu, float16/float32 numbers and what I'm finding confuses me.

The basic version is:

a = torch.rand(3, 4, dtype=torch.float32)
b = torch.rand(4, 5, dtype=torch.float32)
print(a.numpy()@b.numpy() - a@b)

The result is all zeros as expected, however

print((a.cuda()@b.cuda()).cpu() - a@b)

gets non-zero results. Why is Pytorch float32 matmul executed differently on gpu and cpu?

An even more confusing experiment involves float16, as follows:


a = torch.rand(3, 4, dtype=torch.float16)
b = torch.rand(4, 5, dtype=torch.float16)
print(a.numpy()@b.numpy() - a@b)
print((a.cuda()@b.cuda()).cpu() - a@b)

these two results are all non-zero. Why are float16 numbers handled differently by numpy and torch? I know cpu can only do float32 operations and numpy convert float16 to float32 before computing, however the torch calculation is also executed on cpu.

And guess what, print((a.cuda()@b.cuda()).cpu() - a.numpy()@b.numpy()) gets an all zero result! This is pure fantasy for me...

The environment is as follow:

python: 3.8.5
torch: 1.7.0
numpy: 1.21.2
cuda: 11.1
gpu: GeForce RTX 3090

On the advice of some of the commenters, I add the following equal test

(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
((a.cuda()@b.cuda()).cpu().numpy() - a.numpy()@b.numpy()).any()

respectively directly following the above five print functions, and the results are:

False
True
True
True
False

And for the last one, I've tried several times and I think I can rule out luck.

Why does pytorch matmul get different results when executed on cpu and gpu?

Answers (1)

Related Questions