dus
dus

Reputation: 43

Running out of memory while training machine learning model

I have limited memory and training this model is taking too much:

import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np



clf = RandomForestClassifier(n_estimators=10)
print("Created Random Forest classifier\n")

data = pd.read_csv("House_2_ALL.csv")
print("Finished reading data\n")

data.drop("UnixTimeStamp",1)
predict = "Aggregate_Power"
print("Dropped UnixTimeStamp\n")

X = np.array(data.drop([predict],1))
Y = np.array(data[predict])
print("Created numpy Arrays\n")

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.1)
print("Assigned Testing/Training Variables\n")

clf.fit(X_train, Y_train)
print("Fit model\n")

print("Attempting to predict\n")
print(clf.predict(X_test))

When I run this program, my computer states that it has run out of memory and that I need to quit some applications. Any ideas on how to manage memory better or is the only solution to reduce the size of my training dataset?

I have learned that the program runs smoothly until it gets to the "clf.fit(X_train, Y_train)" line so I don't know if this is a problem with pandas' memory hungry datafrmes or sklearn.

Upvotes: 1

Views: 6220

Answers (2)

Priya
Priya

Reputation: 743

There are two possible scenarios here that could cause Memory error.

1.Pandas.read_csv() with chunk_size

You could use chunk_size parameters and load the data a smaller chunk at a time(returns an object we can iterate over).

chunk_size=50000
reader = pd.read_csv('big_file.csv', chunksize=chunk_size)
for i in range(num):
     data_chunk = next(reader)
     # process chunk
                

1.Random Forest Classifier/Regressor

It has default parameters max_depth=None,min_samples_leaf=1 which means full trees are grown. If the dataset is large then the RandomForest could grow fully deep trees and nodes leading to a faster memory consumption.

Let clf = RandomForestClassifier(), clf.fit(X_train, y_train)

then you could check on few things like

print(clf.estimators_[0].tree_.max_depth) # max_depth on a chunk of data. joblib.dump(clf.estimators_[0], "first_tree_clf.joblib") # get the size of a tree.

Now you can try a definite value for hyperparameter max_depth and again fit the model. Tuning of the RandomForest classifier model hyperparameters would create shallow trees per chunk and avoid too much of memory consumption

Upvotes: 0

Abhishek Prajapat
Abhishek Prajapat

Reputation: 1888

In my opinion, the size of your dataset is quite large. You should hence load your dataset in parts for training your model. I will share an example

df = pd.read_csv(dataset_path, chunksize=10000)
# This will load only 10000 rows at a time (you can tune for your RAM)

# Now the df is a generator and hence you can do something like this
for part_df in df:
  '''
  Now here you just consider the "part_df" as your original df and do all
the preprocessing and stuff on it and train the model on it. After training
the model on this part_df you save the model and reload it in the next iteration.
  '''
  df = preprocess_df(df) # Some preprocessing function
  xtrain, xvalid, ytrain, yvalid = train_test_split(df) # Some split
  model = None
  if (os.exists(model_path)): # you won't have a model for first iteration
    model = # Here you load the model
  else:
    model = # Define the model for first iteration of df

  model.fit(...) # train the model

  # Now you save the model for the next iteration

Upvotes: 2

Related Questions