Reputation: 4159
Based on the question and subsequent answer here: When starting an h2o
instance running on a hadoop cluster, (with say hadoop jar h2odriver.jar -nodes 4 -mapperXmx 6g -output hdfsOutputDir
) the callback IP address used to connect to the h2o instance is selected by the hadoop runtime. So in most of the cases the IP address and the port is select by the Hadoop run time to find best available and looks like
....
H2O node 172.18.4.63:54321 reports H2O cluster size 4
H2O node 172.18.4.67:54321 reports H2O cluster size 4
H2O cluster (4 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
Open H2O Flow in your web browser: http://172.18.4.67:54321
Connection url output line: Open H2O Flow in your web browser: http://172.18.4.67:54321
The recommended way of using h2o
is to start and stop individual instances each time you want to use it (sorry, can't currently find the supporting documentation). The problem here is that if you want your python code to start up and connect to a h2o
instance automatically, it is not going to know what IP to connect to until the h2o
instance is already up and running. Thus, a common way to start H2O cluster on Hadoop is to let the Hadoop decide the cluster, then to parse the output for the line
Open H2O Flow in your web browser: x.x.x.x:54321
to get/extract the IP address.
The problem here is that h2o
is a blocking process who's output prints as a stream of text lines as the instance starts up rather than in bulk, this made it hard for me to get the final output line needed using basic python Popen logic to capture output. Is there a way to capture the output as it is being generated to get the line with the connection IP?
Upvotes: 0
Views: 762
Reputation: 4159
The solution that I ended up using was to start the h2o
process in a seperate thread and pass output back to main thread through a queue that we then read from and use regex to search for the connection IP. See example below.
# startup hadoop h2o cluster
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# series of commands to run in-order for for bringing up the h2o cluster on demand
startup_cmds = [
# remove any existing tmp log dir. for h2o processes
'rm -r /some/location/for/h2odriver.jar/output',
# start h2o on cluster
'/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -output hdfsOutputDir'.format("/local/h2o/start/path")
]
# clear legacy temp. dir.
if os.path.isdir(/some/location/for/h2odriver.jar/output):
print subprocess.check_output(shlex.split(startup_cmds[0]))
# start h2o service in background thread
startup_p = subprocess.Popen(shlex.split(startup_cmds[1]),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search("Open H2O Flow in your web browser", line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
# capture connection IP from h2o process output
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search("(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)", h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
Upvotes: 0