Reputation: 43
I have limited memory and training this model is taking too much:
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
clf = RandomForestClassifier(n_estimators=10)
print("Created Random Forest classifier\n")
data = pd.read_csv("House_2_ALL.csv")
print("Finished reading data\n")
data.drop("UnixTimeStamp",1)
predict = "Aggregate_Power"
print("Dropped UnixTimeStamp\n")
X = np.array(data.drop([predict],1))
Y = np.array(data[predict])
print("Created numpy Arrays\n")
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.1)
print("Assigned Testing/Training Variables\n")
clf.fit(X_train, Y_train)
print("Fit model\n")
print("Attempting to predict\n")
print(clf.predict(X_test))
When I run this program, my computer states that it has run out of memory and that I need to quit some applications. Any ideas on how to manage memory better or is the only solution to reduce the size of my training dataset?
I have learned that the program runs smoothly until it gets to the "clf.fit(X_train, Y_train)
" line so I don't know if this is a problem with pandas' memory hungry datafrmes or sklearn.
Upvotes: 1
Views: 6220
Reputation: 743
There are two possible scenarios here that could cause Memory error
.
1.Pandas.read_csv() with chunk_size
You could use chunk_size parameters and load the data a smaller chunk at a time(returns an object we can iterate over).
chunk_size=50000
reader = pd.read_csv('big_file.csv', chunksize=chunk_size)
for i in range(num):
data_chunk = next(reader)
# process chunk
1.Random Forest Classifier/Regressor
It has default parameters max_depth=None
,min_samples_leaf=1
which means full trees are grown. If the dataset is large then the RandomForest could grow fully deep trees and nodes leading to a faster memory consumption.
Let clf = RandomForestClassifier()
,
clf.fit(X_train, y_train)
then you could check on few things like
print(clf.estimators_[0].tree_.max_depth) # max_depth on a chunk of data.
joblib.dump(clf.estimators_[0], "first_tree_clf.joblib") # get the size of a tree.
Now you can try a definite value for hyperparameter max_depth
and again fit the model. Tuning of the RandomForest classifier model hyperparameters would create shallow trees per chunk and avoid too much of memory consumption
Upvotes: 0
Reputation: 1888
In my opinion, the size of your dataset is quite large. You should hence load your dataset in parts for training your model. I will share an example
df = pd.read_csv(dataset_path, chunksize=10000)
# This will load only 10000 rows at a time (you can tune for your RAM)
# Now the df is a generator and hence you can do something like this
for part_df in df:
'''
Now here you just consider the "part_df" as your original df and do all
the preprocessing and stuff on it and train the model on it. After training
the model on this part_df you save the model and reload it in the next iteration.
'''
df = preprocess_df(df) # Some preprocessing function
xtrain, xvalid, ytrain, yvalid = train_test_split(df) # Some split
model = None
if (os.exists(model_path)): # you won't have a model for first iteration
model = # Here you load the model
else:
model = # Define the model for first iteration of df
model.fit(...) # train the model
# Now you save the model for the next iteration
Upvotes: 2