Reputation: 958
I have a dataset with 900,000 rows and 8 columns. 6 of the columns are integers and the other two are floats. When trying to fit about 1/4 of the dataset (200,000) the code runs fine and takes under 30 seconds. When I try to run 400,000 rows or greater My computer permanently freezes because the python.exe process takes up over 5GB of RAM.
First thing I tried was setting the warm_state parameter to True and then going through the data 50,000 rows at a time
n = 0
i = 50,000
clf = sk.RandomForestClassifier(oob_score = True,n_jobs=-1, n_estimators = n, warm_start=True)
While i<= 850,000:
clf.fit(X.ix[n:i],Y.ix[n:i])
n += 50,000
i += 50,000
This didn't solve anything, I ran into the same issue.
Next thing I tried is finding if there was a part of the data that was taking much more memory to process. I recorded the memory increase in the python.exe process and the time it took for the process to complete, if it did complete.
n = 50
clf = sk.RandomForestClassifier(oob_score = True,n_jobs=-1, n_estimators = n, warm_start=True)
Z = X[['DayOfWeek','PdDistrict','Year','Day','Month']] # takes 15s and additional ~600mb RAM (800 total)
Z = X[['X','Address','Y']] # takes 24.8s and additional 1.1GB RAM (1389mb total)
Z = X # never finishes peaks at 5.2GB
%time clf.fit(Z.ix[0:400000],Y.ix[0:400000])
While some data does take longer to process than others none of them can account for 5 Gb of memory being taken.
The data is only a few megabytes in size so I don't see how it can take up so much memory to process.
Upvotes: 3
Views: 12701
Reputation: 5859
I get a similar situation with the too-large Random Forest model. The problem was that trees were too deep, and take a lot of memory. To deal with it, I set max_depth = 6
and it reduces the memory. I even write down about it in blog post. In the article, I was using 32k rows dataset with 15 columns. Setting max_depth=6
decreases memory consumption 66 times and keep similar performance (in the article the performance even increases).
Upvotes: 3
Reputation: 28788
The model you are building just gets too big. Get more ram or built a smaller model. To built a smaller model, either create less trees, or limit the depth of teach trees, say by using max_depth. Try with max_depth=5 and see what happens. Also, how many classes do you have? More classes make everything more expensive.
Also, you might want to try this: https://github.com/scikit-learn/scikit-learn/pull/4783
Upvotes: 3