Reputation: 21
I am trying to use a random forest (with scikit-learn 0.18.1 installed with Anaconda; Python 3) for my research. My dataset contains around 325000 samples, each composed of 11 features (all non-zero values).
I create a random forest with the following call (max_depth
has been set to 10 in order to limit the amount of memory used):
cfl = sk_ensemble.RandomForestClassifier(n_estimators=100, n_jobs=10, verbose=500, max_depth=10)
Unfortunately, the random-forest building needs an enormous amount of memory (I have 128 GB at my disposal, and 100% are used (info retrieved by using top
)). Python then raises a MemoryError.
I create my matrices the following way:
np.array(Xl, dtype=np.float32)
How is it possible that I need more than 128 GB of RAM for this rather light task? (Even when n_jobs=1
, I still have memory problems, but then for the predictions, for which more than the available memory is used...)
For debugging, I launched this command:
dmesg | grep -E -i -B30 'killed process'
and it yielded the following:
[1947333.164124] [ 1193] 81 1193 6137 52 16 81 -900 dbus-daemon
[1947333.164126] [ 1212] 0 1212 81781 73 80 5004 0 firewalld
[1947333.164127] [ 1213] 0 1213 31556 22 17 133 0 crond
[1947333.164129] [ 1215] 0 1215 6461 0 17 62 0 atd
[1947333.164131] [ 1228] 0 1228 27509 1 10 31 0 agetty
[1947333.164133] [ 1230] 0 1230 108909 60 65 487 0 NetworkManager
[1947333.164134] [ 1569] 0 1569 93416 153 91 181 0 rsyslogd
[1947333.164136] [ 1576] 0 1576 138290 63 87 2613 0 tuned
[1947333.164138] [ 1577] 0 1577 28335 1 11 37 0 rhsmcertd
[1947333.164140] [ 1582] 0 1582 20617 15 41 201 -1000 sshd
[1947333.164142] [ 1589] 0 1589 26978 8 7 28 0 rhnsd
[1947333.164143] [ 2221] 0 2221 22244 0 42 256 0 master
[1947333.164146] [ 2267] 89 2267 22287 0 42 251 0 qmgr
[1947333.164149] [19994] 0 19994 36365 2 73 326 0 sshd
[1947333.164151] [19996] 1002 19996 36365 0 68 329 0 sshd
[1947333.164153] [19997] 1002 19997 13175 0 29 142 0 sftp-server
[1947333.164155] [20826] 0 20826 36365 98 72 233 0 sshd
[1947333.164156] [20828] 1002 20828 36400 114 69 220 0 sshd
[1947333.164158] [20829] 1002 20829 28872 46 13 68 0 bash
[1947333.164160] [20862] 0 20862 36365 6 73 324 0 sshd
[1947333.164161] [20877] 1002 20877 36400 38 70 295 0 sshd
[1947333.164163] [20878] 1002 20878 28846 0 13 110 0 bash
[1947333.164164] [20899] 1002 20899 39521 198 30 72 0 top
[1947333.164166] [20929] 0 20929 36379 116 74 215 0 sshd
[1947333.164168] [20931] 1002 20931 36417 118 71 213 0 sshd
[1947333.164169] [20932] 1002 20932 28874 34 14 81 0 bash
[1947333.164171] [20972] 1002 20972 37348 229 27 468 0 vim
[1947333.164172] [20996] 89 20996 22270 83 44 164 0 pickup
[1947333.164174] [21075] 1002 21075 67384359 31985935 71321 4435535 0 python3
[1947333.164176] Out of memory: Kill process 21075 (python3) score 974 or sacrifice child
[1947333.164190] Killed process 21075 (python3) total-vm:269537436kB, anon-rss:127943740kB, file-rss:0kB, shmem-rss:0kB
Upvotes: 1
Views: 434
Reputation: 21
Ok, I found the solution to this issue.
I was working on a regression problem (and not a classification problem).
Using RandomForestRegressor
instead of RandomForestClassifier
solved my memory problems.
Upvotes: 1