Reputation: 106
t1_h = torch.tensor(np.arange(100000), dtype=torch.float32)
cuda0 = torch.device('cuda:0')
t1_d = torch.tensor(np.arange(100000), dtype=torch.float32, device = cuda0)
%timeit -n 10000 max_h = torch.max(t1_h, 0)
%timeit -n 10000 max_d = torch.max(t1_d, 0)
10000 loops, best of 3: 144 µs per loop
10000 loops, best of 3: 985 µs per loop
As you can see above, GPU takes much more time than CPU. But if I don't specify dimension for calculating max, then GPU is faster.
%timeit -n 10000 max_h = torch.max(t1_h)
%timeit -n 10000 max_d = torch.max(t1_d)
10000 loops, best of 3: 111 µs per loop
10000 loops, best of 3: 41.8 µs per loop
I also tried with argmax
instead of max
but it is working correctly (GPU faster than CPU).
%timeit -n 10000 cs_h = torch.argmax(t1_h, 0)
%timeit -n 10000 cs_d = torch.argmax(t1_d, 0)
10000 loops, best of 3: 108 µs per loop
10000 loops, best of 3: 18.1 µs per loop
Is there any reason why torch.max
is slow on GPU after specifying dimension?
Upvotes: 5
Views: 1278
Reputation: 4961
I discovered this myself, and opened an issue in PyTorch. It looks like it'll be fixed soon - maybe version 1.5 or 1.6? - but in the meantime the suggested workaround is to use
ii=a.argmax(0)
maxval = a.gather(0, ii.unsqueeze(0)).squeeze(0)
Upvotes: 1