Saran
Saran

Reputation: 1844

PyTorch CUDA vs Numpy for arithmetic operations? Fastest?

I performed element-wise multiplication using Torch with GPU support and Numpy using the functions below and found that Numpy loops faster than Torch which shouldn't be the case, I doubt.

I want to know how to perform general arithmetic operations with Torch using GPU.

Note: I ran these code snippets in Google Colab notebook

Define the default tensor type to enable global GPU flag

torch.set_default_tensor_type(torch.cuda.FloatTensor if 
                              torch.cuda.is_available() else 
                              torch.FloatTensor)

Initialize Torch variables

x = torch.Tensor(200, 100)  # Is FloatTensor
y = torch.Tensor(200,100) 

Function in question

def mul(d,f):
    g = torch.mul(d,f).cuda()  # I explicitly called cuda() which is not necessary
    return g

When call the function above as %timeit mul(x,y)

Returns:

The slowest run took 10.22 times longer than the fastest. This could mean hat an intermediate result is being cached. 10000 loops, best of 3: 50.1 µs per loop

Now trial with numpy,

Used the same values from torch variables

x_ = x.data.cpu().numpy()
y_ = y.data.cpu().numpy()


def mul_(d,f):
    g = d*f
    return g

%timeit mul_(x_,y_)

Returns

The slowest run took 12.10 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 7.73 µs per loop

Needs some help to understand GPU enabled Torch operations.

Upvotes: 19

Views: 14897

Answers (1)

dennlinger
dennlinger

Reputation: 11488

GPU operations have to additionally get memory to/from the GPU

The problem is that your GPU operation always has to put the input on the GPU memory, and then retrieve the results from there, which is a quite costly operation.

NumPy, on the other hand, directly processes the data from the CPU/main memory, so there is almost no delay here. Additionally, your matrices are extremely small, so even in the best-case scenario, there should only be a minute difference.

This is also partially the reason why you use mini-batches when training on a GPU in neural networks: Instead of having several extremely small operations, you now have "one big bulk" of numbers that you can process in parallel.
Also note that GPU clock speeds are generally way lower than CPU clocks, so the GPU only really shines because it has way more cores. If your matrix does not utilize all of them fully, you are also likely to see a faster result on your CPU.

TL;DR: If your matrix is big enough, you will eventually see a speed-up in CUDA than Numpy, even with the additional cost of the GPU transfer.

Upvotes: 22

Related Questions