R Nanthak
R Nanthak

Reputation: 363

Dataset doesn't fit in memory for LSTM training

I am trying to create model that is trained on large music dataset. The midi files are converted into numpy arrays. Since LSTM requires sequential data, the dataset size becomes so huge on converting into a sequence for the LSTM.

I convert the midi notes into index based on the keynote and duration, so I get 6 classes for C4 key. Likewise I get C3 to B5 so totally 288 classes along with classes for rest periods.

The converted format of a single midi looks like this.

midi = [0,23,54,180,23,45,34,.....];

For training the LSTM, the x and y becomes

x = [[0,23,54..45],[23,54,..,34],...];

y=[[34],[76],...]

The values in x and y are further transformed into one-hot encodings. Hence the size of the data becomes huge for just 60 small mid files, but I have 1700 files. How can I train the model with this amount of files. I checked ImageGenerator but it requires data to be in separate class directories. How to achieve this?

Upvotes: 4

Views: 2058

Answers (2)

R Nanthak
R Nanthak

Reputation: 363

I used a generator class for this problem used the following code. The generator is modified for my purpose. Memory usage is dramatically reduced.

class Generator(Sequence):

    def __init__(self, x_set, y_set, batch_size=4):
        self.x, self.y = x_set, y_set

        self.batch_size = batch_size
        self.indices = np.arange(len(self.x))

    def __len__(self):
        return int(np.ceil(len(self.x) / self.batch_size))

    def __getitem__(self, idx):
        inds = self.indices[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_x = []
        batch_y = []
        for ind in inds:
            ip = []

            for q in self.x[ind]:
                o = np.zeros(323)
                o[int(q)] = 1
                ip.append(o)
            batch_x.append(ip)
            hot_encoded = []
            for val in self.y[ind]:

                t = np.zeros(323)
                t[int(val)] = 1
                hot_encoded.append(t)
            batch_y.append(hot_encoded)

        return np.array(batch_x), np.array(batch_y)

    def on_epoch_end(self):
        # np.random.shuffle(self.indices)
        np.random.shuffle(self.indices)

Upvotes: 1

Théo Rubenach
Théo Rubenach

Reputation: 574

You should generate you training data on the fly, during the training itself. Based on tf documentation, you can write your own generator to use as training data, or inheritate from Sequence.

The first option should look like

def create_data_generator(your_files):
    raw_midi_data = process_files(your_files)
    seq_size = 32

    def _my_generator():
        i = 0 
        while True:
            x = raw_midi_data[i:i + seq_size]
            y = raw_midi_data[i + seq_size]
            i = (i + 1) % (len(raw_midi_data) - seq_size)
            yield x, y

    return _my_generator()

And then call it with (assuming tf >= 2.0)

generator = create_data_generator(your_files)
model.fit(x=generator, ...)

If you are using "old" Keras (from before tensorflow 2.0) which Keras team itself does not recommend, you should use fit_generator instead:

model.fit_generator(generator, ...)

With this solution, you only store your data in memory once, there is no duplication due to overlapping sequences.

Upvotes: 6

Related Questions