Dahlai
Dahlai

Reputation: 705

understanding scikit learn Random Forest memory requirement for prediction

I have a set of 2000 trained random regression trees (from scikit learn's Random Forest Regressor with n_estimators=1). Training the trees in parallel (50 cores) on a large dataset (~100000*700000 = 70GB @ 8-bit) using multiprocessing and shared memory works like a charm. Note, I am not using RF's inbuilt multicore support since I am doing feature selection beforehand.

The problem: when testing a large matrix (~20000*700000) in parallel I always run out of memory (I have access to a server with 500 GB of RAM).

My strategy is to have the test matrix in memory and share it among all processes. According to a statement by one of the developers the memory requirement for testing is 2*n_jobs*sizeof(X), and in my case another factor of *4 is relevant, since the 8bit matrix entries are upcast to float32 internally in RF.

By numbers, I think for testing I need:
14GB to hold the test matrix in memory + 50(=n_jobs)*20000(n_samples)*700(=n_features)*4(upcasting to float)*2 bytes = 14 gb + 5.6 gb = ~21GB of memory.

Yet it always blows up to several hundreds of GB. What am I missing here? (I am on the newest version of scikit-learn, so the old memory issues should be ironed out)

An observation:
Running on one core only memory usage for testing fluctuates between 30 and 100 GB (as measured by free)

My code:

#----------------
#helper functions
def initializeRFtest(*args):
    global df_test, pt_test #initialize test data and test labels as globals in shared memory
    df_test, pt_test = args


def star_testTree(model_featidx):
    return predTree(*model_featidx)

#end of helper functions
#-------------------

def RFtest(models, df_test, pt_test, features_idx, no_trees):
    #test trees in parallel
    ncores = 50
    p = Pool(ncores, initializer=initializeRFtest, initargs=(df_test, pt_test))
    args = itertools.izip(models, features_idx)
    out_list = p.map(star_testTree, args)
    p.close()
    p.join()
    return out_list

def predTree(model, feat_idx):
    #get all indices of samples that meet feature subset requirement
    nan_rows = np.unique(np.where(df_test.iloc[:,feat_idx] == settings.nan_enc)[0])
    all_rows = np.arange(df_test.shape[0])
    rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))]    #discard rows with missing values in the given features

    #predict
    pred = model.predict(df_test.iloc[rows,feat_idx])
    return predicted

#main program
out = RFtest(models, df_test, pt_test, features_idx, no_trees)

Edit: another observation: When chunking the test data the program runs smoothly with much reduced memory usage. This is what I used to make the program run.
Code snippet for the updated predTree function:

def predTree(model, feat_idx):
    # get all indices of samples that meet feature subset requirement
    nan_rows = np.unique(np.where(test_df.iloc[:,feat_idx] == settings.nan_enc)[0])
    all_rows = np.arange(test_df.shape[0])
    rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))]    #discard rows with missing values in the given features

    # predict height per valid sample
    chunksize = 500
    n_chunks = np.int(math.ceil(np.float(rows.shape[0])/chunksize))


    pred = []
    for i in range(n_chunks):
        if n_chunks == 1:
            pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
            pred.append(pred_chunked)
            break
        if i == n_chunks-1:
            pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
        else:
            pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:(i+1)*chunksize], feat_idx])
        print pred_chunked.shape
        pred.append(pred_chunked)
    pred = np.concatenate(pred)

    # populate matrix
    predicted = np.empty(test_df.shape[0])
    predicted.fill(np.nan)
    predicted[rows] = pred
    return predicted

Upvotes: 9

Views: 2296

Answers (1)

sophros
sophros

Reputation: 16758

I am not sure if the memory issue is not related to usage of itertools.izip in args = itertools.izip(models, features_idx) which may trigger creation of copies of the iterator along with its arguments across all threads. Have you tried just using zip?

Another hypothesis might be inefficient garbage collection - not triggered when you need it. I would check if running gc.collect() just before model.predict in predTree does not help.

There is also a 3rd potential reason (probably the most credible). Let me cite Python FAQ on How does Python manage memory?:

In current releases of CPython, each new assignment to x inside the loop will release the previously allocated resource.

In your chunked function you do precisely that - repetitively assign to pred_chunked.

Upvotes: 1

Related Questions