Kevin Sheng
Kevin Sheng

Reputation: 61

Why does pytorch matmul get different results when executed on cpu and gpu?

I am trying to figure out the rounding difference between numpy/pytorch, gpu/cpu, float16/float32 numbers and what I'm finding confuses me.

The basic version is:

a = torch.rand(3, 4, dtype=torch.float32)
b = torch.rand(4, 5, dtype=torch.float32)
print(a.numpy()@b.numpy() - a@b)

The result is all zeros as expected, however

print((a.cuda()@b.cuda()).cpu() - a@b)

gets non-zero results. Why is Pytorch float32 matmul executed differently on gpu and cpu?

An even more confusing experiment involves float16, as follows:


a = torch.rand(3, 4, dtype=torch.float16)
b = torch.rand(4, 5, dtype=torch.float16)
print(a.numpy()@b.numpy() - a@b)
print((a.cuda()@b.cuda()).cpu() - a@b)

these two results are all non-zero. Why are float16 numbers handled differently by numpy and torch? I know cpu can only do float32 operations and numpy convert float16 to float32 before computing, however the torch calculation is also executed on cpu.

And guess what, print((a.cuda()@b.cuda()).cpu() - a.numpy()@b.numpy()) gets an all zero result! This is pure fantasy for me...

The environment is as follow:


On the advice of some of the commenters, I add the following equal test

(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
(a.numpy()@b.numpy() - (a@b).numpy()).any()
((a.cuda()@b.cuda()).cpu() - a@b).numpy().any()
((a.cuda()@b.cuda()).cpu().numpy() - a.numpy()@b.numpy()).any()

respectively directly following the above five print functions, and the results are:

False
True
True
True
False

And for the last one, I've tried several times and I think I can rule out luck.

Upvotes: 6

Views: 3320

Answers (1)

hkchengrex
hkchengrex

Reputation: 4826

The differences are mostly numerical, as mentioned by @talonmies. CPU/GPU and their respectively BLAS libraries are implemented differently and use different operations/order-of-operation, hence the numerical difference.

One possible cause is sequential operation vs. reduction (https://discuss.pytorch.org/t/why-different-results-when-multiplying-in-cpu-than-in-gpu/1356/3), e.g. (((a+b)+c)+d) will have different numerical properties as compared with ((a+b)+(c+d)).

This question also talks about fused operations (multiply-add) which can cause numerical differences.

I did a little bit of testing, and find that the GPU's output in float16 mode can be matched if we promote the datatype to float32 before computation and demote it afterward. This can be caused by internal intermediate casting or the better numerical stability of fused operations (torch.backends.cudnn.enabled does not matter). This does not solve the case in float32 though.

import torch

def test(L, M, N):
    # test (L*M) @ (M*N)
    for _ in range(5000):
        a = torch.rand(L, M, dtype=torch.float16)
        b = torch.rand(M, N, dtype=torch.float16)

        cpu_result = a@b
        gpu_result = (a.cuda()@b.cuda()).cpu()
        if (cpu_result-gpu_result).any():
            print(f'({L}x{M}) @ ({M}x{N}) failed')
            return
    else:
        print(f'({L}x{M}) @ ({M}x{N}) passed')


test(1, 1, 1)
test(1, 2, 1)
test(4, 1, 4)
test(4, 4, 4)

def test2():
    for _ in range(5000):
        a = torch.rand(1, 2, dtype=torch.float16)
        b = torch.rand(2, 1, dtype=torch.float16)

        cpu_result = a@b
        gpu_result = (a.cuda()@b.cuda()).cpu()

        half_result = a[0,0]*b[0,0] + a[0,1]*b[1,0]
        convert_result = (a[0,0].float()*b[0,0].float() + a[0,1].float()*b[1,0].float()).half()

        if ((cpu_result-half_result).any()):
            print('CPU != half')
            return
        if (gpu_result-convert_result).any():
            print('GPU != convert')
            return
    else:
        print('All passed')

test2()

Output:

(1x1) @ (1x1) passed
(1x2) @ (2x1) failed
(4x1) @ (1x4) passed
(4x4) @ (4x4) failed
All passed

You can tell that when the inner dimension is 1, it passes the check (no multiply-add/reduction needed).

Upvotes: 2

Related Questions