SarahW
SarahW

Reputation: 21

Random forest uses too much memory

I am trying to use a random forest (with scikit-learn 0.18.1 installed with Anaconda; Python 3) for my research. My dataset contains around 325000 samples, each composed of 11 features (all non-zero values). I create a random forest with the following call (max_depth has been set to 10 in order to limit the amount of memory used):

 cfl = sk_ensemble.RandomForestClassifier(n_estimators=100, n_jobs=10, verbose=500, max_depth=10)

Unfortunately, the random-forest building needs an enormous amount of memory (I have 128 GB at my disposal, and 100% are used (info retrieved by using top)). Python then raises a MemoryError.

I create my matrices the following way:

np.array(Xl, dtype=np.float32)

How is it possible that I need more than 128 GB of RAM for this rather light task? (Even when n_jobs=1, I still have memory problems, but then for the predictions, for which more than the available memory is used...)

For debugging, I launched this command:

dmesg | grep -E -i -B30 'killed process'

and it yielded the following:

[1947333.164124] [ 1193]    81  1193     6137       52      16       81          -900 dbus-daemon
[1947333.164126] [ 1212]     0  1212    81781       73      80     5004             0 firewalld
[1947333.164127] [ 1213]     0  1213    31556       22      17      133             0 crond
[1947333.164129] [ 1215]     0  1215     6461        0      17       62             0 atd
[1947333.164131] [ 1228]     0  1228    27509        1      10       31             0 agetty
[1947333.164133] [ 1230]     0  1230   108909       60      65      487             0 NetworkManager
[1947333.164134] [ 1569]     0  1569    93416      153      91      181             0 rsyslogd
[1947333.164136] [ 1576]     0  1576   138290       63      87     2613             0 tuned
[1947333.164138] [ 1577]     0  1577    28335        1      11       37             0 rhsmcertd
[1947333.164140] [ 1582]     0  1582    20617       15      41      201         -1000 sshd
[1947333.164142] [ 1589]     0  1589    26978        8       7       28             0 rhnsd
[1947333.164143] [ 2221]     0  2221    22244        0      42      256             0 master
[1947333.164146] [ 2267]    89  2267    22287        0      42      251             0 qmgr
[1947333.164149] [19994]     0 19994    36365        2      73      326             0 sshd
[1947333.164151] [19996]  1002 19996    36365        0      68      329             0 sshd
[1947333.164153] [19997]  1002 19997    13175        0      29      142             0 sftp-server
[1947333.164155] [20826]     0 20826    36365       98      72      233             0 sshd
[1947333.164156] [20828]  1002 20828    36400      114      69      220             0 sshd
[1947333.164158] [20829]  1002 20829    28872       46      13       68             0 bash
[1947333.164160] [20862]     0 20862    36365        6      73      324             0 sshd
[1947333.164161] [20877]  1002 20877    36400       38      70      295             0 sshd
[1947333.164163] [20878]  1002 20878    28846        0      13      110             0 bash
[1947333.164164] [20899]  1002 20899    39521      198      30       72             0 top
[1947333.164166] [20929]     0 20929    36379      116      74      215             0 sshd
[1947333.164168] [20931]  1002 20931    36417      118      71      213             0 sshd
[1947333.164169] [20932]  1002 20932    28874       34      14       81             0 bash
[1947333.164171] [20972]  1002 20972    37348      229      27      468             0 vim
[1947333.164172] [20996]    89 20996    22270       83      44      164             0 pickup
[1947333.164174] [21075]  1002 21075 67384359 31985935   71321  4435535             0 python3
[1947333.164176] Out of memory: Kill process 21075 (python3) score 974 or sacrifice child
[1947333.164190] Killed process 21075 (python3) total-vm:269537436kB, anon-rss:127943740kB, file-rss:0kB, shmem-rss:0kB

Upvotes: 1

Views: 434

Answers (1)

SarahW
SarahW

Reputation: 21

Ok, I found the solution to this issue.

I was working on a regression problem (and not a classification problem).

Using RandomForestRegressorinstead of RandomForestClassifier solved my memory problems.

Upvotes: 1

Related Questions