lotusirous
lotusirous

Reputation: 103

How to deal with a big dataset with H2O

I am trying to train a machine learning model with H2O (3.14). My dataset size is 4Gb and my computer RAM is 2Gb with 2G swap, JDK 1.8. Refer to this article, H2O can process a huge dataset with 2Gb RAM.

  • A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, i.e., you're using more Big Data than physical DRAM. We won't die with a GC death-spiral, but we will degrade to out-of-core speeds. We'll go as fast as the disk will allow. I've personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression.

Some questions around this issues:

Work around 1:

I configured the java heap with options java -Xmx10g -jar h2o.jar. When I load dataset. The H2O information as follows:

However, JVM consumed all RAM memory and Swap, then operating system halted java h2o program.

Work around 2:

I installed H2O spark. I can load dataset but spark was hanging with the following logs with a full swap memory:

 + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.377 192.168.233.133:54321 6965   Thread-47 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.377 192.168.233.133:54321 6965   Thread-48 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.381 192.168.233.133:54321 6965   Thread-45 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.3 MB + FREE:426.7 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!
09-01 02:01:12.382 192.168.233.133:54321 6965   Thread-46 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=840.9 MB OOM!
09-01 02:01:12.384 192.168.233.133:54321 6965   #e Thread WARN: Swapping!  GC CALLBACK, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=802.7 MB OOM!
09-01 02:01:12.867 192.168.233.133:54321 6965   FJ-3-1    WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.4 MB + FREE:426.5 MB == MEM_MAX:2.67 GB), desiredKV=1.03 GB OOM!
09-01 02:01:13.376 192.168.233.133:54321 6965   Thread-46 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!
09-01 02:01:13.934 192.168.233.133:54321 6965   Thread-45 WARN: Swapping!  OOM, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=841.3 MB OOM!
09-01 02:01:12.867 192.168.233.133:54321 6965   #e Thread WARN: Swapping!  GC CALLBACK, (K/V:1.75 GB + POJO:513.2 MB + FREE:426.8 MB == MEM_MAX:2.67 GB), desiredKV=803.2 MB OOM!

In this case, I think the gc collector is waiting for cleaning some unused memory in swap.

How can I process huge dataset with a limited RAM memory ?

Upvotes: 3

Views: 1970

Answers (2)

TomKraljevic
TomKraljevic

Reputation: 3671

The cited article from 2014 is many years out of date, and refers to H2O-2. The within-H2O user-mode swap-to-disk concept was experimental at that time.

But this has never been supported in H2O-3 (which became the main H2O code base around early 2015) because the performance was bad, as the cited StackOverflow post explains.

Upvotes: 2

Darren Cook
Darren Cook

Reputation: 28913

If this is in any way commercial, buy more RAM, or pay a few dollars to rent a few hours on a cloud server.

This is because the extra time and effort to do machine learning on a machine that is too small is just not worth it.

If it is a learning project, with no budget at all: cut the data set into 8 equal-sized parts (*), and just use the first part to make and tune your models. (If the data is not randomly ordered, cut it in 32 equal parts, and then concatenate parts 1, 9, 17 and 25; or something like that.)

If you really, really, really, must build a model using the whole data set, then still do the above. But then save the model, then move to the 2nd of your 8 data sets. You will already have tuned hyperparameters by this point, so you are just generating a model, and it will be quick. Repeat for parts 3 to 8. Now you have 8 models, and can use them in an ensemble.

*: I chose 8, which gives you a 0.5GB data set, which is a quarter of available memory. For the early experiments I'd actually recommend going even smaller, e.g. 50MB, as it will make the iterations so much quicker.

A couple more thoughts:

  • H2O compresses data in-memory. So if the 4GB was the uncompressed data size, you might get by with a smaller memory. (However, remember that the recommendation is for memory that is 3-4x the size of your data.)
  • If you have some friends with similar small-memory computers, you could network them together. 4 to 8 computers might be enough to load your data. It might work well, it might be horribly slow, it depends on the algorithm (and how fast your network is).

Upvotes: 1

Related Questions