user3078500
user3078500

Reputation: 302

H2O Using a large dataset size

What is the maximum dataset size that I am allowed to use on h2o.

Specifically can the dataset size be larger than the ram / diskspace on each node.

I have nodes with around 25 gb disk space and 40 gb of ram, I want to use a dataset that around 70 gb.

Thank you

Getting errors of:

Exception in thread "qtp1392425346-39505" java.lang.OutOfMemoryError: GC overhead limit exceeded

Upvotes: 0

Views: 1159

Answers (1)

Erin LeDell
Erin LeDell

Reputation: 8819

There is no maximum dataset size in H2O. The requirements are defined by how big of a cluster you create. There is more info about how to tell H2O what the max heap size you'd like here.

If your dataset is 70G, and you have nodes with only 40G RAM, then you will have to use a multi-node cluster. The general rule of thumb that we tell people is that your H2O cluster should be 3x the size of your data on disk. It's highly dependent on which algorithm you are using, however.

70G*3 = 210G, so you might want to try a 5-node cluster. Or, you could start with fewer nodes, try running your code and increase the size of the cluster as required.

Upvotes: 2

Related Questions