Reputation: 317
As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context? Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?
I will also post this question in the PyTorch Forum...
Thanks!
Upvotes: 4
Views: 731
Reputation: 57619
In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length.
This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.
But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
You can use the DataLoader
given you write your own Dataset
class and you are using batch_size=1
. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate
will give you a hard time):
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
class FooDataset(Dataset):
def __init__(self, data, target):
assert len(data) == len(target)
self.data = data
self.target = target
def __getitem__(self, index):
return self.data[index], self.target[index]
def __len__(self):
return len(self.data)
data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']
ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)
print(list(enumerate(dl)))
# [(0, [
# 1 2 3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
# 4 5 6 7 8
# [torch.LongTensor of size 1x5]
# , ('b',)])]
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end [...], will it have any negative effect on my training [...]?
Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence
and pack_padded_sequence
. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).
Upvotes: 4