Reputation: 23
I am using some data to train a random-forest classifier using scikit-learn. The shape of my data is something like 8000 datapoints with a little bit over 60.000 features. After training-set classification I am using the clf.feature_importances_ to access the features, sort them by value, and delete the features with values = 0. I also delete the last feature with the least information in my system. After that I write all the remaining features with their respective value into a new file. This is the point, where my recursion is starting. I read in the file with the features I want to use, lacking all the useless information from the run before. I do not load my dataset again, only using this suubset of features using pandas. Actually, everything is working fine, the variables get reduced and the filtering is working as intended BUT the memory usage is going up with every recursion step such that I have a usage of roughly 13% after only 10 iterations - from 4.5% in the beginning.
I already tried the Garbage Collector using gc.collect() before starting the new iteration step. Furthermore I tried to delete some variable using del and also to newly set the variable with empty lists or just as plain zeros to avoid high stacking of variables (which is not the case).
I used this function to determine the size of my variables and that they are indeed going down.
import sys
def sizeof_fmt(num, suffix='B'):
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
The recursion mainly is this, and does not contain the read in of my data:
def rek(rekfile,run):
import gc
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
##tried to set all variables to empty lists or zeros which did not work
cladd = {}
sorted_cladd ={}
x_pre = []
y = 0
X = 0
clf = 0
with open(rekfile) as inf:##tab seperated feature value file
for line in inf:
spl = line.split('\t')
ensg = spl[0]
x_pre.append(ensg)
y = data_as_pd['label']
X = data_as_pd[x_pre]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
#Create a rf Classifier
clf=RandomForestClassifier(n_estimators=500)
#Train the model using the training sets
clf.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
accu = metrics.accuracy_score(y_test, y_pred)
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",accu)
##only proceeding with the rekursion if accuracy limit is satisfied
if accu > 0.925:
featout = "/home/andre/tf/forest/rekursion/rf_feature_values/feature_values_rf_fullmodel_wo_normal_n500_50perc-split_rekursion_run_"+str(run)+".txt" ##where the features for the next runs are saved
orf = open(featout,'w')
index = 0
##sorting the features for their importance and deleting zeros
for classi in clf.feature_importances_:
if classi !=float(0):
cladd[x_pre[index]] = classi
else:
dump.write(x_pre[index]+'\n')
index = index + 1
sorted_cladd = sorted(cladd.items(),key=lambda x: x[1], reverse=True)
##deleting the feature with the least information
sorted_cladd.pop()[-1]
for a in range(len(sorted_cladd)):
orf.write(sorted_cladd[a][0]+'\t'+str(sorted_cladd[a][1])+'\n')
orf.close()
##set run variable
newrun = run+1
del clf ##tried to reduce size deleting clf (not working)
gc.collect() ##tried garbage collector (not working)
rek(featout,newrun) ##new iteration
I expect the memory usage (RAM) to go down during the iteration process because I reduce the input data with every step, but the used amount actually goes up until the "Memory Error" error message.
I hope somebody can help me with this because I really do not see what I am missing here. Any help is very much appreciated.
Regards,
André
EDIT: Using a while loop completely worked for me and reduced the used memory as intended!
Upvotes: 2
Views: 192
Reputation: 4629
Python stack frames are huge. I don't know the details of your implementation, but usually recursion is not efficient speaking in term of memory usage. It improves readability at a significant cost in memory.
Some useful insight on recursion in python here
I would suggest to change the code with an iterative approach, which should reduce the memory usage
Upvotes: 1
Reputation: 364
I think a first way towards improvement is to refactor your code so that it is easier for the garbage collector to track down variables' going out of scope. That being said, and as mentioned by @Nikaidoh, recursion is not great in Python, and should be avoided when possible. In your case, it is actually rather easy to do the same thing with a for loop, which can even improve readability imho.
Here is a rewrite of your code, with both refactoring and an alternative to the recursive implementation. I would advise going with the latter (thus removing the export_features_list
and rek
functions), and possibly modifing fit_clf
to return clf.importance_features_
rather than clf
itself (thus removing the del clf
from main ; you would have to test as to the pertinence of retaining an explicit gc.collect()
call), as indicated in comments.
import gc
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
def get_predictors_list(rekfile):
with open(rekfile) as inf: #tab seperated feature value file
w_pre = [
line.split('\t', 1)[0]
for line in inf
]
return x_pre
def fit_clf(x_pre):
y = data_as_pd['label']
X = data_as_pd[x_pre]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
# Create a rf Classifier
clf=RandomForestClassifier(n_estimators=500)
# Train the model using the training sets
clf.fit(X_train, y_train)
# Predict the response for test dataset
y_pred = clf.predict(X_test)
accu = metrics.accuracy_score(y_test, y_pred)
# Model Accuracy: how often is the classifier correct?
print("Accuracy:", accu)
# Return the trained model and its accuracy.
return clf, accu
def export_features_list(run, clf, x_pre):
featout = "/home/andre/tf/forest/rekursion/rf_feature_values/feature_values_rf_fullmodel_wo_normal_n500_50perc-split_rekursion_run_"+str(run)+".txt" ##where the features for the next runs are saved
orf = open(featout, 'w')
index = 0
## sorting the features for their importance and deleting zeros
cladd = {}
for classi in clf.feature_importances_:
if classi != float(0):
cladd[x_pre[index]] = classi
else:
dump.write(x_pre[index]+'\n')
index = index + 1
sorted_cladd = sorted(cladd.items(), key=lambda x: x[1], reverse=True)
## deleting the feature with the least information
sorted_cladd.pop()[-1]
for a in range(len(sorted_cladd)):
orf.write(sorted_cladd[a][0]+'\t'+str(sorted_cladd[a][1])+'\n')
orf.close()
## return the path to the kept features' file
return featout
# Recursive way - preferrably avoid this as recursion is not that great in Python.
def rek(rekfile, run):
x_pre = get_predictors_list(rekfile)
clf, accu = fit_clf(x_pre)
## only proceeding with the rekursion if accuracy limit is satisfied
if accu > 0.925:
featout = export_features_list(run, clf, x_pre)
gc.collect() # hopefully useless, but who knows?
rek(featout, run + 1)
# Alternative way, with a for loop.
def main(features_file, accu_thresh=0.925, max_runs=20):
"""Train RandomForest classifiers, iteratively removing features.
Stop removing features when it would result in an accuracy below
`accu_thresh`, or when `max_runs` features have been removed.
Return the list of kept features, as well as the list of dropped
ones, in removal order.
"""
features = get_predictors_list(rekfile)
dropped = []
for run in range(max_aruns):
print('Run %i' % run)
# Fit a model and evaluate its accuracy.
# FIXME: we could also return features importance only!
clf, accuracy = fit_clf(features)
# If the accuracy is too low, stop the process.
if accuracy <= accu_thresh:
break
# Otherwise, drop the least important feature.
dropped.append(features.pop(np.argmin(clf.feature_importances_)))
print('Dropped feature %s.' % dropped[run])
# Explicitly delete the model and garbage collect, for safety.
del clf
gc.collect()
# Return selected features, and the list of dropped ones.
return features, dropped
I hope this helps. Best regards, Paul
Upvotes: 0