Reputation: 556

h2o sparkling water save frame to disk

I am trying to import a frame by creating a h2o frame from a spark parquet file. The File is 2GB has about 12M rows and Sparse Vectors with 12k cols. It is not that big in parquet format but the import takes forever. In h2o it is actually reported as 447mb compressed size. Quite small actually.

Am I doing it wrong and when I actually finish importing (took 39min), Is there any form in h2o to save the frame to disk for a fast loading next time??

I understand h2o does some magic behind the scene which takes so long but I only found a download csv option which is slow and huge for a 11k x 1M sparse data and I doubt it is any faster to import.

I feel like there is a part missing. Any Info about h2o data import/export is appreciated. Model save/load works great but train/val/test data loading seems an unreasonably slow procedure.

I got 10 sparkworkers with 10g each and gave the driver 8g. This should be plenty.

Upvotes: 1

Answers (2)

Michal Kurka

Reputation: 566

I suggest to export the dataframe from Spark into SVMLight file format (see MLUtils.saveAsLibSVMFile(...). This format can be then natively ingested by H2O.

As Darren pointed out you can export data from H2O in multiple parts which speeds up the export. However H2O currently only supports export to CSV files. This is sub-optimal for your use case of very sparse data. This functionality is accessible via the Java API:

water.fvec.Frame.export(yourFrame, "/target/directory", yourFrame.key.toString, true, -1 /* automatically determine number of part files */)

Upvotes: 0

Darren Cook

Reputation: 28928

Use h2o.exportFile() (h2o.export_file() in Python), with the parts argument set to -1. The -1 effectively means that each machine in the cluster will export just its own data. In your case you'd end up with 10 files, and it should be 10 times quicker than otherwise.

To read them back in, use h2o.importFile() and specify all 10 parts when loading:

frame <- h2o.importFile(c(
  "s3n://mybucket/my.dat.1",
  "s3n://mybucket/my.dat.2",
  ...
  ) )

By giving an array of files, they will be loaded and parsed in parallel.

For a local LAN cluster it is recommended to be using HDFS for this. I've had reasonable results by keeping the files on S3 when running a cluster on EC2.

Upvotes: 1

h2o sparkling water save frame to disk

Answers (2)

Related Questions