Hanjo Odendaal
Hanjo Odendaal

Reputation: 1441

Multi Node H2O cluster in R not detecting other EC2 instances

I have been struggling to get a Multi Node H2O cluster up and running using AWS EC2 instances. I have followed the advice from this thread, but still struggle with the nodes not seeing each other. The EC2 instances all use the same AMI that I have pre-built, so the same h2o.jar file is on all of them,

I have also tried the following troubleshooting advice:

Here are my steps:

1) Start AWS EC2 in same availability zone and get private IPs and network cidr (172.31.0.0/20). Put ip addresses into flatfile.txt

172.31.8.210:54321
172.31.9.207:54321
172.31.13.136:54321

2) Copy the flatfile.txt to all servers to which I want to connect as nodes and start H2O

# cluster_run
library(h2oEnsemble)
library(ssh)

ips <- gsub("(.*):.*", "\\1", readLines("flatfile.txt"))

start_cluster <- function(ip){
  # Copy flatfile across
  session <- ssh_connect(paste0("ubuntu@", ip), keyfile = "mykey.pem")
  scp_upload(session, "flatfile.txt")

  # Ensure no h2o instance is already running
  out <- ssh_exec_wait(session, "sudo pkill java")

  # Start H2O cluster
  cmd <- gsub("\\s+", " ", paste0("ssh -i mykey.pem -o 'StrictHostKeyChecking no' ubuntu@", ip, 
         " 'java -Xmx20g 
         -jar /home/rstudio/R/x86_64-pc-linux-gnu-library/3.5/h2o/java/h2o.jar
         -name mycluster
         -network 172.31.0.0/20
         -flatfile flatfile.txt 
         -port 54321 &'"))
  system(cmd, wait = FALSE)

}
start_cluster(ips[3])
start_cluster(ips[2])
start_cluster(ips[1])

3) Once this has been done, I now want to connect R to my new Multi Node cluster

 h2o.init(startH2O = F)
 h2o.shutdown(prompt = FALSE)

This is where I see that the nodes aren't being picked up: enter image description here

I have also seen that when I start the H2O cluster on the different nodes, it isnt picking up the other machines within the network: enter image description here

Upvotes: 1

Views: 84

Answers (1)

TomKraljevic
TomKraljevic

Reputation: 3671

You need to add port 54321+1 (so 54322) to the security group, as well.

The internal communication goes through 54322.

(I would also specify /16 for -network because it’s easier for other people to understand. For example, even if you are sure /20 is technically correct for your network setup, I can’t easily be sure. :-)

Depending on the actual network setup, you probably don’t need -network flag at all. Your instances probably only have one interface.

Upvotes: 2

Related Questions