Reputation: 356
I have a list of tensors and all of them are present on the GPU. I obtained this list by splitting one tensor on the GPU using torch.split
. I want to get list of sums of the list of tensors I have. So, in simple terms, I want to get a list in which, the first element is sum of first tensor in the list, and so on. If I run a for loop for this, does it get parallelised? If not, is there a way to make it run parallely? I want to parallelize it since the list is pretty long, and the sum operation can be done parallely, and independently on every tensor present on the list. If this operation can be performed on the GPU, the performance gain would be immense.
UPDATE : Consider I have a list of tensors as follows :
ls
[tensor([[0.8469, 0.3712, 0.2956],
[0.6548, 0.5284, 0.8682],
[0.5748, 0.2390, 0.1402],
[0.0010, 0.1794, 0.6048],
[0.4636, 0.4101, 0.6543]], device='cuda:0'),
tensor([[0.2138, 0.3613, 0.8712],
[0.4689, 0.0503, 0.7342],
[0.1368, 0.0688, 0.9223]], device='cuda:0'),
tensor([[0.3131, 0.6142, 0.1555],
[0.4099, 0.5000, 0.7578],
[0.7353, 0.2425, 0.4407],
[0.5943, 0.0377, 0.4820],
[0.5898, 0.9585, 0.6993]], device='cuda:0'),
tensor([[0.8629, 0.3172, 0.4248],
[0.9957, 0.6998, 0.0931],
[0.0258, 0.9898, 0.5250]], device='cuda:0'),
tensor([[0.0298, 0.4033, 0.9465],
[0.2763, 0.9412, 0.4873]], device='cuda:0')]
As you can see, I have a list of 5 tensors of different shapes. Each tensor has a shape of 3 in their first dimension. The shape is different because of 0th dimension. So, in this example, the shapes of the tensor in the list are [[5,3], [3, 3], [5, 3], [3, 3], [2,3]]
. I want to get a list of tensors from this list as follows :
sums = [torch.sum(li, axis=0) for li in ls]
sums
[tensor([2.5412, 1.7280, 2.5632], device='cuda:0'),
tensor([0.8195, 0.4804, 2.5277], device='cuda:0'),
tensor([2.6424, 2.3528, 2.5352], device='cuda:0'),
tensor([1.8844, 2.0068, 1.0429], device='cuda:0'),
tensor([0.3062, 1.3445, 1.4338], device='cuda:0')]
So, as you can see, the first tensor in the list is sum of first tensor in the list ls
along the dimension 0
. The second tensor is sum of second tensor in list ls
along the dimension 0
and so on.
To do this task, I'm currently using a for loop. which iteratively calculates the sums and appends it to sums
list. However, this is very inefficient as my list of tensors is really big, of the order of 100K, and doing this in each iteration is super inefficient. I wanted to find out if there is any way to do this more efficiently.
The list ls
of tensors is obtained by splitting a big tensor like this :
splitter = [5, 3, 5, 3, 2]
A = torch.rand(18, 3).cuda()
ls = torch.split(A, splitter)
ls
(tensor([[0.1969, 0.6113, 0.3563],
[0.9180, 0.7759, 0.5953],
[0.0279, 0.4014, 0.2268],
[0.9026, 0.3821, 0.1498],
[0.3630, 0.9144, 0.3277]], device='cuda:0'),
tensor([[2.1312e-02, 5.2311e-01, 8.9177e-02],
[4.7427e-01, 2.4503e-04, 1.2559e-01],
[5.1641e-01, 9.1357e-01, 9.5637e-01]], device='cuda:0'),
tensor([[0.3730, 0.4251, 0.9437],
[0.5634, 0.3086, 0.5891],
[0.5602, 0.0872, 0.2128],
[0.7717, 0.1920, 0.3977],
[0.5787, 0.3488, 0.7499]], device='cuda:0'),
tensor([[0.9338, 0.4330, 0.8843],
[0.5646, 0.0574, 0.8790],
[0.4692, 0.5831, 0.9160]], device='cuda:0'),
tensor([[0.9786, 0.5209, 0.9364],
[0.4370, 0.4917, 0.3672]], device='cuda:0'))
So, if avoiding the for loop is not possible, do anyone have any ideas on summing the main tensor A, according to a splitter provided? So, for example, in the code above, the splitter is [5, 3, 5, 3, 2]
. So, I want to obtain a tensor res
from tensor A
such that the first row of res
is sum of first 5 rows of A
(because splitter[0]
= 5) along dim=0
. The second row of res
is sum of next 3 rows (row 5 to row 7) of A
. And so on. Can I do this without using a for loop? Or can I parallelise this for loop since the opearation it is doing are independent of each other and are mutually exclusive and exhaustive.
I hope the added details are enough. If I need to add any further details to the question, please let me know. Thanks in advance :)
Upvotes: 4
Views: 2110
Reputation: 1091
PyTorch runs GPU operations asynchronously (see docs).
When you call a function that uses the GPU, the operations are enqueued to the particular device
This means, your sum operations may run in parallel.
I have made a simple experiment to test this. If I am right, it proves that you don't need to worry about parallelism here.
import torch
A = torch.rand(100000, 32, device='cuda')
splits = torch.split(A, 4)
Your code:
%%timeit -r1 -n5
sums = [s.sum() for s in splits]
torch.cuda.synchronize()
# Output: 5 loops, best of 1: 374 ms per loop
Added synchronization after every sum operation:
%%timeit -r1 -n5
sums = [torch.cuda.synchronize() or s.sum() for s in splits]
# Output: 5 loops, best of 1: 897 ms per loop
Upvotes: 3
Reputation: 1974
If the splits can be the same, then you can solve it in a vectorized way:
splitter = [6, 6, 6]
A = torch.rand(18, 3).cuda()
A_splits = A.reshape(-1, len(splitter), 3)
sums = A_splits.sum(dim=1)
That is not the general solution you were looking for, but maybe it already solves your problem?
Edit:
Ideally, you would replace the loop by a vectorized operation (such as .sum(dim=1)
), but vectorized operations only work on tensor data. If the differences between tensors is not that big, you can use zeros to pad all of them to the same shape.
splitter = [5, 3, 5, 3, 2] # largest number of tensors is 5
A = torch.rand(18, 3).cuda()
A_pad = torch.zeros(max(splitter) * len(splitter), 3)
splitter_index = torch.tensor([i + (max(splitter) * n) for n, l in enumerate(splitter) for i in range(l)])
A_pad[splitter_index] = A
A_sum = A_pad.view(-1, max(splitter), 3).sum(dim=1) # double check the dim
A_sum
tensor([[2.2903, 2.3379, 2.6550],
[1.1394, 1.2519, 0.7374],
[1.7970, 2.8287, 2.4855],
[0.7964, 1.1991, 1.4032],
[1.8656, 0.4916, 0.2935]])
There is a memory/speed trade-off here. Hopefully, that is closer to what you were looking for.
Upvotes: 0