cybermaggedon
cybermaggedon

Reputation: 31

Sparkling water won't start on Spark on Google DataProc

I'm trying to use H2O Sparkling Water on Google DataProc. I've successfully run Sparkling Water on a standalone Spark, and now moved on to use it on DataProc. Initially, I got an error about spark.dynamicAllocation.enabled not being supported, so I've gone on the master and started like this...

pyspark \
   --conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
   --conf spark.dynamicAllocation.enabled=false

The interaction to start Sparkling Water looks like this, once the stage gets to around 30000, it starts to grind, and then after 30 mins or so, there's a string of errors:

>>> from pysparkling import *
>>> import h2o
>>> hc = H2OContext.getOrCreate(spark)
18/04/11 11:56:08 WARN org.apache.spark.h2o.backends.internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
18/04/11 11:56:08 WARN org.apache.spark.h2o.backends.internal.InternalH2OBackend: Due to non-deterministic behavior of Spark broadcast-based joins
We recommend to disable them by
configuring `spark.sql.autoBroadcastJoinThreshold` variable to value `-1`:
sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
[Stage 0:=================>                               (35346 + 11) / 100001]

I've tried a variety of things like: - Deploying small (3 nodes). - Deploying a 30 worker cluster. - Tried running DataProc image 1.1 (Spark 2.0), 1.2 (Spark 2.2) and preview (Spark 2.2).

Also tried a variety of Spark options:

spark.ext.h2o.fail.on.unsupported.spark.param=false \
spark.ext.h2o.nthreads=2
spark.ext.h2o.cluster.size=2
spark.ext.h2o.default.cluster.size=2
spark.ext.h2o.hadoop.memory=50m
spark.ext.h2o.repl.enabled=false
spark.ext.h2o.flatfile=false
spark.dynamicAllocation.enabled=false
spark.executor.memory=700m

Anyone have any luck with H2O on Google DataProc?

Detailed errors are:

18/04/11 12:08:40 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1523445048432_0005_01_000006 on host: cluster-dev-w-0.c.trust-networks.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1523445048432_0005_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Container exited with a non-zero exit code 1

18/04/11 12:08:48 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result RpcResponse{requestId=5571077381947066483, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /10.154.0.12:59387; closing connection
java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

and later:

Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.Class.newReflectionData(Class.java:2513)
    at java.lang.Class.reflectionData(Class.java:2503)
    at java.lang.Class.privateGetDeclaredConstructors(Class.java:2660)
    at java.lang.Class.getConstructor0(Class.java:3075)
    at java.lang.Class.newInstance(Class.java:412)
    at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:403)
    at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:393)
    at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)

Upvotes: 1

Views: 524

Answers (2)

cybermaggedon
cybermaggedon

Reputation: 31

Ok, think I solved this for myself. Sparkling Water allocates resources based on a number of settings which are non-default in Google DataProc.

I edited /etc/spark/conf/spark-defaults.conf, and changed spark.dynamicAllocation.enabled to false and changed spark.ext.h2o.dummy.rdd.mul.factor to 1, which allowed the H2O cluster to start up in about 3 minutes with about a tenth of the resources.

If it is too slow starting up for you, try reducing spark.executor.instances from 10000 to 5000 or 1000, although this settings affects the performance of everything else you're running on the Spark cluster.

Upvotes: 2

TomKraljevic
TomKraljevic

Reputation: 3671

You're getting java.lang.OutOfMemoryError. Give more memory.

Upvotes: 1

Related Questions