Reputation: 705
I have a set of 2000 trained random regression trees (from scikit learn's Random Forest Regressor with n_estimators=1
). Training the trees in parallel (50 cores) on a large dataset (~100000*700000 = 70GB @ 8-bit) using multiprocessing
and shared memory works like a charm. Note, I am not using RF's inbuilt multicore support since I am doing feature selection beforehand.
The problem: when testing a large matrix (~20000*700000) in parallel I always run out of memory (I have access to a server with 500 GB of RAM).
My strategy is to have the test matrix in memory and share it among all processes. According to a statement by one of the developers the memory requirement for testing is 2*n_jobs*sizeof(X), and in my case another factor of *4 is relevant, since the 8bit matrix entries are upcast to float32 internally in RF.
By numbers, I think for testing I need:
14GB to hold the test matrix in memory + 50(=n_jobs)*20000(n_samples)*700(=n_features)*4(upcasting to float)*2 bytes = 14 gb + 5.6 gb = ~21GB of memory.
Yet it always blows up to several hundreds of GB. What am I missing here? (I am on the newest version of scikit-learn, so the old memory issues should be ironed out)
An observation:
Running on one core only memory usage for testing fluctuates between 30 and 100 GB (as measured by free
)
My code:
#----------------
#helper functions
def initializeRFtest(*args):
global df_test, pt_test #initialize test data and test labels as globals in shared memory
df_test, pt_test = args
def star_testTree(model_featidx):
return predTree(*model_featidx)
#end of helper functions
#-------------------
def RFtest(models, df_test, pt_test, features_idx, no_trees):
#test trees in parallel
ncores = 50
p = Pool(ncores, initializer=initializeRFtest, initargs=(df_test, pt_test))
args = itertools.izip(models, features_idx)
out_list = p.map(star_testTree, args)
p.close()
p.join()
return out_list
def predTree(model, feat_idx):
#get all indices of samples that meet feature subset requirement
nan_rows = np.unique(np.where(df_test.iloc[:,feat_idx] == settings.nan_enc)[0])
all_rows = np.arange(df_test.shape[0])
rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features
#predict
pred = model.predict(df_test.iloc[rows,feat_idx])
return predicted
#main program
out = RFtest(models, df_test, pt_test, features_idx, no_trees)
Edit:
another observation:
When chunking the test data the program runs smoothly with much reduced memory usage. This is what I used to make the program run.
Code snippet for the updated predTree
function:
def predTree(model, feat_idx):
# get all indices of samples that meet feature subset requirement
nan_rows = np.unique(np.where(test_df.iloc[:,feat_idx] == settings.nan_enc)[0])
all_rows = np.arange(test_df.shape[0])
rows = all_rows[np.invert(np.in1d(all_rows, nan_rows))] #discard rows with missing values in the given features
# predict height per valid sample
chunksize = 500
n_chunks = np.int(math.ceil(np.float(rows.shape[0])/chunksize))
pred = []
for i in range(n_chunks):
if n_chunks == 1:
pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
pred.append(pred_chunked)
break
if i == n_chunks-1:
pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:], feat_idx])
else:
pred_chunked = model.predict(test_df.iloc[rows[i*chunksize:(i+1)*chunksize], feat_idx])
print pred_chunked.shape
pred.append(pred_chunked)
pred = np.concatenate(pred)
# populate matrix
predicted = np.empty(test_df.shape[0])
predicted.fill(np.nan)
predicted[rows] = pred
return predicted
Upvotes: 9
Views: 2296
Reputation: 16758
I am not sure if the memory issue is not related to usage of itertools.izip
in args = itertools.izip(models, features_idx)
which may trigger creation of copies of the iterator along with its arguments across all threads. Have you tried just using zip
?
Another hypothesis might be inefficient garbage collection - not triggered when you need it. I would check if running gc.collect()
just before model.predict
in predTree
does not help.
There is also a 3rd potential reason (probably the most credible). Let me cite Python FAQ on How does Python manage memory?:
In current releases of CPython, each new assignment to x inside the loop will release the previously allocated resource.
In your chunked function you do precisely that - repetitively assign to pred_chunked
.
Upvotes: 1