robertandrei
robertandrei

Reputation: 1

Memory Error when trying to fit a classifier

I am working on a classification task and my training file is a csv which has about 8GB(approx. 7.2 million lines and 212 columns). Firstly, my approach was to put all this csv file in a pandas dataframe and then use it as multidimensional array, to train my naïve Bayes classifier, but when I tried to fit the data I got a memory error ( I am working on a machine with 8GB of ram and a 64 bit version of Python).

After that, I tried to split my dataframe in 5 pieces and use the partia_fit() method, but I still run out of memory.

This is my code so far (the target values are extracted from other txt file):

from csv import DictReader
from sklearn.naive_bayes import MultinomialNB
import numpy
from pandas import*


target_values_train = []

with open('train.txt') as f:
    reader = DictReader(f, delimiter='\t')
    for row in reader:
        target_values_train.append(int(row['human-generated']))

y_train = numpy.asarray(target_values_train)
y_train = y_train[:, numpy.newaxis]

tp = read_csv('train-indices.csv', iterator=True, chunksize=1000, delimiter=';', skiprows=1)
df_train = concat(tp, ignore_index=True)
del df_train['id']
print(df_train)
print(df_train.shape)
print(y_train.shape)
df1, df2, df3, df4 = np.array_split(df_train, 5)
y1, y2, y3, y4, y5=np.array_split(y_train, 5)
print(df1.shape)
print(df2.shape)
print(df3.shape)


clf = MultinomialNB()
clf.partial_fit(df1, y1)
clf.partial_fit(df2, y2)
clf.partial_fit(df3, y3)
clf.partial_fit(df4, y4)
clf.partial_fit(df5, y5)

Any suggestion is very welcome.

Upvotes: 0

Views: 1484

Answers (1)

Mohamed Ali JAMAOUI
Mohamed Ali JAMAOUI

Reputation: 14689

Using pd.concat you are reloading all the data once again in memory, so it's equivalent to loading the file all at once.

You need to train by iterating over the chunks one by one. For example, you would do the following:

tp = read_csv('training_data.csv', iterator=True, chunksize=1000, delimiter=';', skiprows=1)
clf = MultinomialNB()
for chunk in tp:
    clf.partial_fit(chunk[["train_col1", "train_col1",...]], chunk["y1"])

Upvotes: 1

Related Questions