user10430178
user10430178

Reputation: 335

Generating Sequence/Batches with specific lengths for RNN for grouped data

The problem arrises when I wish to pass data from different groups into RNN - most examples assume elegant timeseries, but when adding groups, we can't simply window over the dataframe, we need to jump when the group changes, so that the data comes from within the group.

These groups are just different people, so I wanted to keep their sequences to themselves. E.g. a user browsing a website and us collecting pageview data. Or it could be different stocks and their associated price movements.

import pandas as pd
data = {
    'group_id': [1,1,1,1,2,2],
    'timestep': [1,2,3,4,1,2],
    'x': [6,5,4,3,2,1],
    'y': [0,1,1,1,0,1]
}
df = pd.DataFrame(data=data)


   group_id  timestep  x  y
0         1         1  6  0
1         1         2  5  1
2         1         3  4  1
3         1         4  3  1
4         2         1  2  0
5         2         2  1  1

Let's assume we would like to use the batch size of 2 samples and each of the samples will have 3 timesteps. RNNSequence.__len__ = 3 (below) batches, but this is not possible, because we can at most get 2 samples from the 1st group (that makes 1 batch). The 2nd group has only 2 time steps, so the iteration is not possible.

from keras.utils import Sequence

class RNNSequence(Sequence):

    def __init__(self, x_set, y_set, batch_size, seq_length):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size
        self.seq_length = seq_length

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        # get_batch to be coded
        return get_batch(idx, self.x, self.y, self.batch_size, self.seq_length)

What is the most efficient way of getting these batches using a sequence?

My solution was to actually not use the Sequence and instead use a custom generator that spits out data without knowing how many batches will there be in advance. and use fit_generator(custom_generator, max_queue_size=batch_size) instead. Is this the most efficient way? The problem here is that there is no shuffling and it could be a problem?

Desired output for batchsize=2, seq_length=3 is:

X = [ 
        [ [6], [5], [4] ], 
        [ [5], [4], [3] ] 
    ]

Y = [ 1, 1 ]

Upvotes: 3

Views: 1040

Answers (1)

wjakobw
wjakobw

Reputation: 535

It seems you need to not only know the number of batches but also be able to output any batch given just the batch number. You could create an index of all samples in RNNSequence.__init__ or earlier, then assemble batches from this. In __getitem__ you can then output the batches accordingly.

This quick and dirty pseudocode should illustrate the concept of the sample index. If needed you might decide to use functions in pandas or numpy instead etc.

# Pseuducode for generating indexes for where samples start.
seq_len = 3
sample_start_ids = []
for group_id, group in enumerate(groups):
    for timestep_id, timestep in enumerate(group_timesteps):
        # Only add as sample if it is the first
        # timestep in the group or if a full sample fits.
        if timestep == 1 or timestep <= len(group_timesteps) - seq_len+1:
            sample_start_ids.append((group_id, timestep_id))

num_samples = len(sample_start_ids)

# Group the samples into batches of appropriate size.
pass

num_batches = len(your_batches)

Upvotes: 2

Related Questions