0_0
0_0

Reputation: 574

Frame upload/creation on H2O external backend hangs from python/pyspark

I'm experiencing an issue where the h2o.H2OFrame([1,2,3]) command is creating a frame within h2o on an internal backend, but not on an external backend. Instead, the connection is not terminating (the frame is being created but the process hangs).

It would appear that a post to /3/ParseSetup is not returning (where urllib3 seems to get stuck). More specifically, from the h2o logs for a connection to the external backend, an example of this is (where I've shortened the date and IP):

* 10.*.*.15:56565 8120 #7003-141 INFO: Reading byte InputStream into Frame:
* 10.*.*.15:56565 8120 #7003-141 INFO: frameKey: upload_8a440dcf457c1e5deacf76a7ac1a4955
* 10.*.*.15:56565 8120 #7003-141 DEBUG: write-lock upload_8a440dcf457c1e5deacf76a7ac1a4955 by job null
* 10.*.*.15:56565 8120 #7003-141 INFO: totalChunks: 1
* 10.*.*.15:56565 8120 #7003-141 INFO: totalBytes:  21
* 10.*.*.15:56565 8120 #7003-141 DEBUG: unlock upload_8a440dcf457c1e5deacf76a7ac1a4955 by job null
* 10.*.*.15:56565 8120 #7003-141 INFO: Success.
* 10.*.*.15:56565 8120 #7003-135 INFO: POST /3/ParseSetup, parms: {source_frames=["upload_8a440dcf457c1e5deacf76a7ac1a4955"], check_header=1, separator=44}

By comparison, the internal backend completes that call and the log files contain:

** 10.*.*.15:54444 2421 #0581-148 INFO: totalBytes:  21
** 10.*.*.15:54444 2421 #0581-148 INFO: Success.
** 10.*.*.15:54444 2421 #0581-149 INFO: POST /3/ParseSetup, parms: {source_frames=["upload_b985730020211f576ef75143ce0e43f2"], check_header=1, separator=44}
** 10.*.*.15:54444 2421 #0581-150 INFO: POST /3/Parse, parms: {number_columns=1, source_frames=["upload_b985730020211f576ef75143ce0e43f2"], column_types=["Numeric"], single_quotes=False, parse_type=CSV, destination_frame=Key_Frame__upload_b985730020211f576ef75143ce0e43f2.hex, column_names=["C1"], delete_on_done=True, check_header=1, separator=44, blocking=False, chunk_size=4194304}
...

There is a difference in the by job null lock that occurs, but it is released, so I suspect that it is not a critical issue. I've curled that endpoint unsuccessfully on both backends, and am reviewing the source code to determine why.

I am able to view the uploaded frame running h2o.ls(), despite the hanging process, and I'm able to retrieve the frame using h2o.get_frame(frame_id="myframe_id") on the external backend.

I've tried/confirmed the following things:

    hadoop jar h2odriver-sw2.3.0-cdh5.14-extended.jar -Dmapreduce.job.queuename=root.users.myuser -jobname extback -baseport 56565 -nodes 10 -mapperXmx 10g -network 10.*.*.0/24
    sdf = session.createDataFrame([
    ('a', 1, 1.0), ('b', 2, 2.0)],
    schema=StructType([StructField("string", StringType()),
                       StructField("int", IntegerType()),
                       StructField("float", FloatType())])) 
    hc.as_h2o_frame(sdf)

From a YARN point of view, I attempted client and cluster mode submissions of the simple test app:

spark2-submit --master yarn --deploy-mode cluster --queue root.users.myuser --conf 'spark.ext.h2o.client.port.base=65656' extreboot.py

and without --master yarn and --deploy-mode cluster for the default client mode.

Lastly, the extreboot.py code is:

    from pyspark.conf import SparkConf
    from pyspark.sql import SparkSession
    from pysparkling import *
    import h2o

    conf = SparkConf().setAll([
    ('spark.ext.h2o.client.verbose', True),
    ('spark.ext.h2o.client.log.level', 'DEBUG'),
    ('spark.ext.h2o.node.log.level', 'DEBUG'),
    ('spark.ext.h2o.client.port.base', '56565'),
    ('spark.driver.memory','8g'),
    ('spark.ext.h2o.backend.cluster.mode', 'external')])

    session = SparkSession.builder.config(conf=conf).getOrCreate() 

    ip_addr='10.10.10.10'  
    port=56565

    conf = H2OConf(session).set_external_cluster_mode().use_manual_cluster_start().set_h2o_cluster(ip_addr, port).set_cloud_name("extback")
    hc = H2OContext.getOrCreate(session, conf)

    print(h2o.ls())
    h2o.H2OFrame([1,2,3])
    print('DONE')

Does anyone know why it may be hanging (in comparison to the internal backend), what I'm doing wrong, or which steps I can take to better debug this? Thanks!

Upvotes: 0

Views: 244

Answers (1)

Lauren
Lauren

Reputation: 5778

I would recommend upgrading to the latest version of Sparkling Water (currently 2.3.26 and available here), since you are using 2.3.12 and there have been several fixes for hanging issues since then. Hopefully a quick upgrade fixes your issue.

Upvotes: 1

Related Questions