Reputation: 356

Does my for loop run parallely, if all the tensors involved in the loop are on the GPU?

I have a list of tensors and all of them are present on the GPU. I obtained this list by splitting one tensor on the GPU using torch.split. I want to get list of sums of the list of tensors I have. So, in simple terms, I want to get a list in which, the first element is sum of first tensor in the list, and so on. If I run a for loop for this, does it get parallelised? If not, is there a way to make it run parallely? I want to parallelize it since the list is pretty long, and the sum operation can be done parallely, and independently on every tensor present on the list. If this operation can be performed on the GPU, the performance gain would be immense.

UPDATE : Consider I have a list of tensors as follows :

ls 
[tensor([[0.8469, 0.3712, 0.2956],
         [0.6548, 0.5284, 0.8682],
         [0.5748, 0.2390, 0.1402],
         [0.0010, 0.1794, 0.6048],
         [0.4636, 0.4101, 0.6543]], device='cuda:0'),
 tensor([[0.2138, 0.3613, 0.8712],
         [0.4689, 0.0503, 0.7342],
         [0.1368, 0.0688, 0.9223]], device='cuda:0'),
 tensor([[0.3131, 0.6142, 0.1555],
         [0.4099, 0.5000, 0.7578],
         [0.7353, 0.2425, 0.4407],
         [0.5943, 0.0377, 0.4820],
         [0.5898, 0.9585, 0.6993]], device='cuda:0'),
 tensor([[0.8629, 0.3172, 0.4248],
         [0.9957, 0.6998, 0.0931],
         [0.0258, 0.9898, 0.5250]], device='cuda:0'),
 tensor([[0.0298, 0.4033, 0.9465],
         [0.2763, 0.9412, 0.4873]], device='cuda:0')]

As you can see, I have a list of 5 tensors of different shapes. Each tensor has a shape of 3 in their first dimension. The shape is different because of 0th dimension. So, in this example, the shapes of the tensor in the list are [[5,3], [3, 3], [5, 3], [3, 3], [2,3]]. I want to get a list of tensors from this list as follows :

sums = [torch.sum(li, axis=0) for li in ls]
sums
[tensor([2.5412, 1.7280, 2.5632], device='cuda:0'),
 tensor([0.8195, 0.4804, 2.5277], device='cuda:0'),
 tensor([2.6424, 2.3528, 2.5352], device='cuda:0'),
 tensor([1.8844, 2.0068, 1.0429], device='cuda:0'),
 tensor([0.3062, 1.3445, 1.4338], device='cuda:0')]

So, as you can see, the first tensor in the list is sum of first tensor in the list ls along the dimension 0. The second tensor is sum of second tensor in list ls along the dimension 0 and so on.

To do this task, I'm currently using a for loop. which iteratively calculates the sums and appends it to sums list. However, this is very inefficient as my list of tensors is really big, of the order of 100K, and doing this in each iteration is super inefficient. I wanted to find out if there is any way to do this more efficiently.

The list ls of tensors is obtained by splitting a big tensor like this :

splitter = [5, 3, 5, 3, 2]

A = torch.rand(18, 3).cuda()

ls = torch.split(A, splitter)
ls
(tensor([[0.1969, 0.6113, 0.3563],
         [0.9180, 0.7759, 0.5953],
         [0.0279, 0.4014, 0.2268],
         [0.9026, 0.3821, 0.1498],
         [0.3630, 0.9144, 0.3277]], device='cuda:0'),
 tensor([[2.1312e-02, 5.2311e-01, 8.9177e-02],
         [4.7427e-01, 2.4503e-04, 1.2559e-01],
         [5.1641e-01, 9.1357e-01, 9.5637e-01]], device='cuda:0'),
 tensor([[0.3730, 0.4251, 0.9437],
         [0.5634, 0.3086, 0.5891],
         [0.5602, 0.0872, 0.2128],
         [0.7717, 0.1920, 0.3977],
         [0.5787, 0.3488, 0.7499]], device='cuda:0'),
 tensor([[0.9338, 0.4330, 0.8843],
         [0.5646, 0.0574, 0.8790],
         [0.4692, 0.5831, 0.9160]], device='cuda:0'),
 tensor([[0.9786, 0.5209, 0.9364],
         [0.4370, 0.4917, 0.3672]], device='cuda:0'))

So, if avoiding the for loop is not possible, do anyone have any ideas on summing the main tensor A, according to a splitter provided? So, for example, in the code above, the splitter is [5, 3, 5, 3, 2]. So, I want to obtain a tensor res from tensor A such that the first row of res is sum of first 5 rows of A(because splitter[0] = 5) along dim=0. The second row of res is sum of next 3 rows (row 5 to row 7) of A. And so on. Can I do this without using a for loop? Or can I parallelise this for loop since the opearation it is doing are independent of each other and are mutually exclusive and exhaustive.

I hope the added details are enough. If I need to add any further details to the question, please let me know. Thanks in advance :)

Upvotes: 4

Answers (2)

roman

Reputation: 1091

PyTorch runs GPU operations asynchronously (see docs).

When you call a function that uses the GPU, the operations are enqueued to the particular device

This means, your sum operations may run in parallel.

I have made a simple experiment to test this. If I am right, it proves that you don't need to worry about parallelism here.

import torch

A = torch.rand(100000, 32, device='cuda')
splits = torch.split(A, 4)

Your code:

%%timeit -r1 -n5
sums = [s.sum() for s in splits]
torch.cuda.synchronize()

# Output: 5 loops, best of 1: 374 ms per loop

Added synchronization after every sum operation:

%%timeit -r1 -n5
sums = [torch.cuda.synchronize() or s.sum() for s in splits]

# Output: 5 loops, best of 1: 897 ms per loop

Upvotes: 3

Victor Zuanazzi

Reputation: 1974

If the splits can be the same, then you can solve it in a vectorized way:

splitter = [6, 6, 6]

A = torch.rand(18, 3).cuda()

A_splits = A.reshape(-1, len(splitter), 3)

sums = A_splits.sum(dim=1)

That is not the general solution you were looking for, but maybe it already solves your problem?

Edit:

Ideally, you would replace the loop by a vectorized operation (such as .sum(dim=1)), but vectorized operations only work on tensor data. If the differences between tensors is not that big, you can use zeros to pad all of them to the same shape.

splitter = [5, 3, 5, 3, 2] # largest number of tensors is 5

A = torch.rand(18, 3).cuda()

A_pad = torch.zeros(max(splitter) * len(splitter), 3)

splitter_index = torch.tensor([i +  (max(splitter) * n) for n, l in enumerate(splitter) for i in range(l)])

A_pad[splitter_index] =  A

A_sum = A_pad.view(-1, max(splitter), 3).sum(dim=1) # double check the dim

A_sum

tensor([[2.2903, 2.3379, 2.6550],
        [1.1394, 1.2519, 0.7374],
        [1.7970, 2.8287, 2.4855],
        [0.7964, 1.1991, 1.4032],
        [1.8656, 0.4916, 0.2935]])

There is a memory/speed trade-off here. Hopefully, that is closer to what you were looking for.

Upvotes: 0

Does my for loop run parallely, if all the tensors involved in the loop are on the GPU?

Answers (2)

Related Questions