Why is the memory usage going up, even though the usage of the single variables is going down?

Question

I am using some data to train a random-forest classifier using scikit-learn. The shape of my data is something like 8000 datapoints with a little bit over 60.000 features. After training-set classification I am using the clf.feature_importances_ to access the features, sort them by value, and delete the features with values = 0. I also delete the last feature with the least information in my system. After that I write all the remaining features with their respective value into a new file. This is the point, where my recursion is starting. I read in the file with the features I want to use, lacking all the useless information from the run before. I do not load my dataset again, only using this suubset of features using pandas. Actually, everything is working fine, the variables get reduced and the filtering is working as intended BUT the memory usage is going up with every recursion step such that I have a usage of roughly 13% after only 10 iterations - from 4.5% in the beginning.

I already tried the Garbage Collector using gc.collect() before starting the new iteration step. Furthermore I tried to delete some variable using del and also to newly set the variable with empty lists or just as plain zeros to avoid high stacking of variables (which is not the case).

I used this function to determine the size of my variables and that they are indeed going down.

import sys
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f %s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f %s%s" % (num, 'Yi', suffix)

for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
                       key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))

The recursion mainly is this, and does not contain the read in of my data:

def rek(rekfile,run):
    import gc
    import sklearn
    from sklearn.model_selection import train_test_split
    from sklearn import metrics
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier

    ##tried to set all variables to empty lists or zeros which did not work
    cladd = {}
    sorted_cladd ={}
    x_pre = []
    y = 0
    X = 0
    clf = 0
    with open(rekfile) as inf:##tab seperated feature value file
        for line in inf:
            spl = line.split('	')
            ensg = spl[0]
            x_pre.append(ensg)
    y = data_as_pd['label']
    X = data_as_pd[x_pre]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
    #Create a rf Classifier
    clf=RandomForestClassifier(n_estimators=500) 
    #Train the model using the training sets
    clf.fit(X_train, y_train)
    #Predict the response for test dataset
    y_pred = clf.predict(X_test)
    accu = metrics.accuracy_score(y_test, y_pred)
    # Model Accuracy: how often is the classifier correct?
    print("Accuracy:",accu)
    ##only proceeding with the rekursion if accuracy limit is satisfied
    if accu > 0.925:

        featout = "/home/andre/tf/forest/rekursion/rf_feature_values/feature_values_rf_fullmodel_wo_normal_n500_50perc-split_rekursion_run_"+str(run)+".txt" ##where the features for the next runs are saved

        orf = open(featout,'w')
        index = 0
        ##sorting the features for their importance and deleting zeros
        for classi in clf.feature_importances_:
            if classi !=float(0):
                cladd[x_pre[index]] = classi
            else:
                dump.write(x_pre[index]+'
')
            index = index + 1
        sorted_cladd = sorted(cladd.items(),key=lambda x: x[1], reverse=True)
        ##deleting the feature with the least information
        sorted_cladd.pop()[-1]
        for a in range(len(sorted_cladd)):
            orf.write(sorted_cladd[a][0]+'	'+str(sorted_cladd[a][1])+'
')
        orf.close()
        ##set run variable
        newrun = run+1
        del clf ##tried to reduce size deleting clf (not working)
        gc.collect() ##tried garbage collector (not working)
        rek(featout,newrun) ##new iteration

I expect the memory usage (RAM) to go down during the iteration process because I reduce the input data with every step, but the used amount actually goes up until the "Memory Error" error message.

I hope somebody can help me with this because I really do not see what I am missing here. Any help is very much appreciated.

Regards,

André

EDIT: Using a while loop completely worked for me and reduced the used memory as intended!

Nikaido · Accepted Answer

Python stack frames are huge. I don't know the details of your implementation, but usually recursion is not efficient speaking in term of memory usage. It improves readability at a significant cost in memory.

Some useful insight on recursion in python here

I would suggest to change the code with an iterative approach, which should reduce the memory usage

Why is the memory usage going up, even though the usage of the single variables is going down?

Answers (2)

Related Questions