Preprocess huge data with a custom data generator function for keras

Question

Actually I'm building a keras model and I have a dataset in the msg format with over 10 million instances with 40 features which are all categorical. For the moment i'm using just a sample of it since reading all the dataset and encoding it is impossbile to fit into the memory. Here a part of the code i'm using:

import pandas as pd
from category_encoders import BinaryEncoder as be
from sklearn.preprocessing import StandardScaler


def model():
   model = Sequential()
   model.add(Dense(120, input_dim=233, kernel_initializer='uniform', activation='selu'))
   model.add(Dense(12, kernel_initializer='uniform', activation='sigmoid'))
   model.compile(SGD(lr=0.008),loss='mean_squared_error', metrics=['accuracy'])
   return model

def addrDataLoading():

   data=pd.read_msgpack('datum.msg')
   data=data.dropna(subset=['s_address','d_address'])
   data=data.sample(300000) # taking a sample of all the dataset to make the    encoding possible
   y=data[['s_address','d_address']]
   x=data.drop(['s_address','d_address'],1)

   encX = be().fit(x, y)
   numeric_X= encX.transform(x)
   encY=be().fit(y,y)
   numeric_Y=encY.transform(y)
   scaler=StandardScaler()
   X_all=scaler.fit_transform(numeric_X)
   x_train=X_all[0:250000,:]
   y_train=numeric_Y.iloc[0:250000,:]
   x_val=X_all[250000:,:]    
   y_val=numeric_Y.iloc[250000:,:]

   return x_train,y_train,x_val,y_val 



x_train,y_train,x_val,y_val=addrDataLoading()

model.fit(x_train, y_train,validation_data=(x_val,y_val),nb_epoch=20, batch_size=200)

So my question is how to use a custom data generator function to read and process all the data I have and not just a sample, and then use fit_generator() function to train my model?

EDIT

This is a sample of the data: netData

I think that taking different samples from the data results in different encoding dimensions.

For this sample there's 16 different categories: 4 addresses (3 bit), 4 hostnames (3 bit ), 1 subnetmask (1 bit), 5 infrastructures (3 bit ), 1 accesszone(1 bit ), so the binary encoding will give us 11 bit and the new dimension of the data is 11 previously 5. So let's say for another sample in the address column we have 8 different categories this will give 4 bit in binary and we let the same number of categories in the other columns so the overall encoding will result in 12 dimensions. I believe that what's causing the problem.

Preprocess huge data with a custom data generator function for keras

Answers (1)

Slightly slow solution (repeating the same actions)

Edit - fit BinatyEncoder before create generators

More efficient solution - scketch

Related Questions