Reputation: 6369
I am trying to run sklearn random forest classification on 2,79,900 instances having 5 attributes and 1 class. But i am getting memory allocation error while trying to run the classification at the fit line, it is not able to train the classifier itself. Any suggestions on how to resolve this issue?
The data a is
x,y, day, week, Accuracy
x and y are the coordinates day is which day of the month (1-30) the week is which day of the week (1-7) and accuracy is an integer
code:
import csv
import numpy as np
from sklearn.ensemble import RandomForestClassifier
with open("time_data.csv", "rb") as infile:
re1 = csv.reader(infile)
result=[]
##next(reader, None)
##for row in reader:
for row in re1:
result.append(row[8])
trainclass = result[:251900]
testclass = result[251901:279953]
with open("time_data.csv", "rb") as infile:
re = csv.reader(infile)
coords = [(float(d[1]), float(d[2]), float(d[3]), float(d[4]), float(d[5])) for d in re if len(d) > 0]
train = coords[:251900]
test = coords[251901:279953]
print "Done splitting data into test and train data"
clf = RandomForestClassifier(n_estimators=500,max_features="log2", min_samples_split=3, min_samples_leaf=2)
clf.fit(train,trainclass)
print "Done training"
score = clf.score(test,testclass)
print "Done Testing"
print score
Error:
line 366, in fit
builder.build(self.tree_, X, y, sample_weight, X_idx_sorted)
File "sklearn/tree/_tree.pyx", line 145, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 244, in sklearn.tree._tree.DepthFirstTreeBuilder.build
File "sklearn/tree/_tree.pyx", line 735, in sklearn.tree._tree.Tree._add_node
File "sklearn/tree/_tree.pyx", line 707, in sklearn.tree._tree.Tree._resize_c
File "sklearn/tree/_utils.pyx", line 39, in sklearn.tree._utils.safe_realloc
MemoryError: could not allocate 10206838784 bytes
Upvotes: 3
Views: 7064
Reputation: 11
Please try Google Colaboratory. You can connect with the localhost or hosted runtime. It worked for me for n_estimators=10000.
Upvotes: 1
Reputation: 169
I met with the same MemoryErr recently. But I fixed it by reducing the training data size instead of modifying my model parameters. My OOB value was 0.98 meaning that the model is very less likely overfit.
Upvotes: 0
Reputation: 793
From the scikit-learn doc.: "The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values."
I would try to adjust these parameters then. Also, you can try a mem. profiler or try to run it on GoogleCollaborator if your machine has too few RAM.
Upvotes: 1