Memory Error when trying to fit a classifier

Question

I am working on a classification task and my training file is a csv which has about 8GB(approx. 7.2 million lines and 212 columns). Firstly, my approach was to put all this csv file in a pandas dataframe and then use it as multidimensional array, to train my naïve Bayes classifier, but when I tried to fit the data I got a memory error ( I am working on a machine with 8GB of ram and a 64 bit version of Python).

After that, I tried to split my dataframe in 5 pieces and use the partia_fit() method, but I still run out of memory.

This is my code so far (the target values are extracted from other txt file):

from csv import DictReader
from sklearn.naive_bayes import MultinomialNB
import numpy
from pandas import*


target_values_train = []

with open('train.txt') as f:
    reader = DictReader(f, delimiter='	')
    for row in reader:
        target_values_train.append(int(row['human-generated']))

y_train = numpy.asarray(target_values_train)
y_train = y_train[:, numpy.newaxis]

tp = read_csv('train-indices.csv', iterator=True, chunksize=1000, delimiter=';', skiprows=1)
df_train = concat(tp, ignore_index=True)
del df_train['id']
print(df_train)
print(df_train.shape)
print(y_train.shape)
df1, df2, df3, df4 = np.array_split(df_train, 5)
y1, y2, y3, y4, y5=np.array_split(y_train, 5)
print(df1.shape)
print(df2.shape)
print(df3.shape)


clf = MultinomialNB()
clf.partial_fit(df1, y1)
clf.partial_fit(df2, y2)
clf.partial_fit(df3, y3)
clf.partial_fit(df4, y4)
clf.partial_fit(df5, y5)

Any suggestion is very welcome.

Memory Error when trying to fit a classifier

Answers (1)

Related Questions