Reputation: 397
I am attempting to train an RNN on time series data, and while there are plenty of tutorials out there on how to build a RNN model I am having some trouble with building the dataloader object for this task. The data is all going to be the same length, so no need for padding as well. The approach I have taken so far is to return a range of data in the getitem function on the dataset class and define the length as
len(data) - seq_len + 1
, however I feel that this is a bit "hacky" and that there should be a more proper way to do this. This method seems confusing and I feel that this would cause problems if collaborating with a group. More specifically, I think that somehow overriding the sampler function in the Pytorch Dataset constructor is the correct way, but I am having trouble understanding how to implement that. Below is the current dataset class I have built, can anyone point me in the right direction with how to fix it? Thank you in advance.
class CustomDataset(Dataset):
def __init__(self, df, cats, y, seq_l):
self.n, self.seq_l = len(df), seq_l
self.cats = np.array(np.stack([c.values for n,c in df[cats].items()], 1).astype(np.int64))
self.conts = np.array(np.stack([c.values for n,c in df[[i for i in df.columns if i not in cats]].items()], 1).astype(np.float32))
self.y = np.array(y)
def __len__(self): return len(self.y) - self.seq_l + 1
def __getitem__(self, idx):
return [
(torch.from_numpy(self.cats[idx:idx+self.seq_l]),
torch.from_numpy(self.conts[idx:idx+self.seq_l])),
self.y[idx+self.seq_l-1]
]
Upvotes: 3
Views: 4820
Reputation: 1434
If I understood correctly you have time series data and you want to crate batches of data with the same length by sampling from it? I think you can use Dataset for returning just one sample of data as it was initially intended by the PyTorch developers. You can stack them in the batch with your own _collate_fn function and pass it to the DataLoader class (_collate_fn is a callable which takes a list of samples and returns a batch, usually, for example, padding is done there). So you would not have a dependency of length (=batch size in your dataset class). I assume you want to preserve sequential order of your samples when you form a batch (given that you work with time series), you can write your own Sampler class (or use SequentialSampler already available in PyTorch). As a result, you will decouple your sample representation, forming them in a batch (_collate_fn in DataLoader) and sampling (Sampler class). Hope this helps.
Upvotes: 3