Reputation: 335
The problem arrises when I wish to pass data from different groups into RNN - most examples assume elegant timeseries, but when adding groups, we can't simply window over the dataframe, we need to jump when the group changes, so that the data comes from within the group.
These groups are just different people, so I wanted to keep their sequences to themselves. E.g. a user browsing a website and us collecting pageview data. Or it could be different stocks and their associated price movements.
import pandas as pd
data = {
'group_id': [1,1,1,1,2,2],
'timestep': [1,2,3,4,1,2],
'x': [6,5,4,3,2,1],
'y': [0,1,1,1,0,1]
}
df = pd.DataFrame(data=data)
group_id timestep x y
0 1 1 6 0
1 1 2 5 1
2 1 3 4 1
3 1 4 3 1
4 2 1 2 0
5 2 2 1 1
Let's assume we would like to use the batch size of 2 samples and each of the samples will have 3 timesteps. RNNSequence.__len__ = 3
(below) batches, but this is not possible, because we can at most get 2 samples from the 1st group (that makes 1 batch). The 2nd group has only 2 time steps, so the iteration is not possible.
from keras.utils import Sequence
class RNNSequence(Sequence):
def __init__(self, x_set, y_set, batch_size, seq_length):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
self.seq_length = seq_length
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
# get_batch to be coded
return get_batch(idx, self.x, self.y, self.batch_size, self.seq_length)
What is the most efficient way of getting these batches using a sequence?
My solution was to actually not use the Sequence and instead use a custom generator that spits out data without knowing how many batches will there be in advance. and use fit_generator(custom_generator, max_queue_size=batch_size)
instead. Is this the most efficient way? The problem here is that there is no shuffling and it could be a problem?
Desired output for batchsize=2, seq_length=3 is:
X = [
[ [6], [5], [4] ],
[ [5], [4], [3] ]
]
Y = [ 1, 1 ]
Upvotes: 3
Views: 1040
Reputation: 535
It seems you need to not only know the number of batches but also be able to output any batch given just the batch number. You could create an index of all samples in RNNSequence.__init__
or earlier, then assemble batches from this. In __getitem__
you can then output the batches accordingly.
This quick and dirty pseudocode should illustrate the concept of the sample index. If needed you might decide to use functions in pandas or numpy instead etc.
# Pseuducode for generating indexes for where samples start.
seq_len = 3
sample_start_ids = []
for group_id, group in enumerate(groups):
for timestep_id, timestep in enumerate(group_timesteps):
# Only add as sample if it is the first
# timestep in the group or if a full sample fits.
if timestep == 1 or timestep <= len(group_timesteps) - seq_len+1:
sample_start_ids.append((group_id, timestep_id))
num_samples = len(sample_start_ids)
# Group the samples into batches of appropriate size.
pass
num_batches = len(your_batches)
Upvotes: 2