Reputation: 103
I am trying to train a machine learning model with H2O (3.14). My dataset size is 4Gb and my computer RAM is 2Gb with 2G swap, JDK 1.8. Refer to this article, H2O can process a huge dataset with 2Gb RAM.
- A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, i.e., you're using more Big Data than physical DRAM. We won't die with a GC death-spiral, but we will degrade to out-of-core speeds. We'll go as fast as the disk will allow. I've personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression.
Some questions around this issues:
--cleaner
in h2o? I configured the java heap with options java -Xmx10g -jar h2o.jar
. When I load dataset. The H2O information as follows:
However, JVM consumed all RAM memory and Swap, then operating system halted java h2o program.
I installed H2O spark. I can load dataset but spark was hanging with the following logs with a full swap memory:
+ FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.377 192.168.233.133:54321 6965 Thread-47 WARN: Swapping! OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.377 192.168.233.133:54321 6965 Thread-48 WARN: Swapping! OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.381 192.168.233.133:54321 6965 Thread-45 WARN: Swapping! OOM, (K/V:1.75 GB + POJO:513.3 MB + FREE:426.7 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!
09-01 02:01:12.382 192.168.233.133:54321 6965 Thread-46 WARN: Swapping! OOM, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=840.9 MB OOM!
09-01 02:01:12.384 192.168.233.133:54321 6965 #e Thread WARN: Swapping! GC CALLBACK, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=802.7 MB OOM!
09-01 02:01:12.867 192.168.233.133:54321 6965 FJ-3-1 WARN: Swapping! OOM, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=1.03 GB OOM!
09-01 02:01:13.376 192.168.233.133:54321 6965 Thread-46 WARN: Swapping! OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!
09-01 02:01:13.934 192.168.233.133:54321 6965 Thread-45 WARN: Swapping! OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.867 192.168.233.133:54321 6965 #e Thread WARN: Swapping! GC CALLBACK, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!
In this case, I think the gc
collector is waiting for cleaning some unused memory in swap.
How can I process huge dataset with a limited RAM memory ?
Upvotes: 3
Views: 1970
Reputation: 3671
The cited article from 2014 is many years out of date, and refers to H2O-2. The within-H2O user-mode swap-to-disk concept was experimental at that time.
But this has never been supported in H2O-3 (which became the main H2O code base around early 2015) because the performance was bad, as the cited StackOverflow post explains.
Upvotes: 2
Reputation: 28913
If this is in any way commercial, buy more RAM, or pay a few dollars to rent a few hours on a cloud server.
This is because the extra time and effort to do machine learning on a machine that is too small is just not worth it.
If it is a learning project, with no budget at all: cut the data set into 8 equal-sized parts (*), and just use the first part to make and tune your models. (If the data is not randomly ordered, cut it in 32 equal parts, and then concatenate parts 1, 9, 17 and 25; or something like that.)
If you really, really, really, must build a model using the whole data set, then still do the above. But then save the model, then move to the 2nd of your 8 data sets. You will already have tuned hyperparameters by this point, so you are just generating a model, and it will be quick. Repeat for parts 3 to 8. Now you have 8 models, and can use them in an ensemble.
*: I chose 8, which gives you a 0.5GB data set, which is a quarter of available memory. For the early experiments I'd actually recommend going even smaller, e.g. 50MB, as it will make the iterations so much quicker.
A couple more thoughts:
Upvotes: 1