Reputation: 302
What is the maximum dataset size that I am allowed to use on h2o.
Specifically can the dataset size be larger than the ram / diskspace on each node.
I have nodes with around 25 gb disk space and 40 gb of ram, I want to use a dataset that around 70 gb.
Thank you
Getting errors of:
Exception in thread "qtp1392425346-39505" java.lang.OutOfMemoryError: GC overhead limit exceeded
Upvotes: 0
Views: 1159
Reputation: 8819
There is no maximum dataset size in H2O. The requirements are defined by how big of a cluster you create. There is more info about how to tell H2O what the max heap size you'd like here.
If your dataset is 70G, and you have nodes with only 40G RAM, then you will have to use a multi-node cluster. The general rule of thumb that we tell people is that your H2O cluster should be 3x the size of your data on disk. It's highly dependent on which algorithm you are using, however.
70G*3 = 210G, so you might want to try a 5-node cluster. Or, you could start with fewer nodes, try running your code and increase the size of the cluster as required.
Upvotes: 2