Jim Crozier
Jim Crozier

Reputation: 1418

Connect sparklyr to remote spark connection

I would like to connect my local desktop RStudio session to a remote spark session via sparklyr. When you go to add a new connection in the sparklyr ui tab in RStudio and choose cluster is says that you have to be running on the cluster, or have a high bandwidth connection to the cluster.

Can anyone shed light on how to create that kind of connection? I am not sure how to create reproducible example of this, but in general what I would like to do is:

library(sparklyr)
sc <- spark_connect(master = "spark://ip-[MY_PRIVATE_IP]:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")

from a remote server. I understand that there will be latency, especially if trying to pass data between the remotes. I also understand that it would be better to have the rstudio-server on the actual cluster- but that is not always possible, and I am looking for a sparklyr option for interacting between my server and my desktop RStudio session. Thanks.

Upvotes: 11

Views: 7585

Answers (3)

Akshay Kadidal
Akshay Kadidal

Reputation: 535

I finally managed to connect my local R to a cloud instance of Spark cluster (HD insights in my case) using Livy

within sparklyr's spark_connect there is an option to connect to livy. (Method = "livy")

sc <- spark_connect(master = "https://<clustername>.azurehdinsight.net/livy/",
                     method = "livy", config = livy_config(
                       username = "<admin>",
                       password = rstudioapi::askForPassword("Livy password:")))

Upvotes: 2

Romain
Romain

Reputation: 21878

Using more recent version of sparklyr (version 0.9.2 for example) it's possible to connect to a remote Spark cluster.

Here is an example to connect to a Spark standalone cluster version 2.3.1. See Master URLs for other master URL schemes.

#install.packages("sparklyr")
library(sparklyr)

# You have to install locally (on the driver where RStudio is running) the same Spark version
spark_v <- "2.3.1"
cat("Installing Spark in the directory:", spark_install_dir())
spark_install(version = spark_v)

sc <- spark_connect(spark_home = spark_install_find(version=spark_v)$sparkVersionDir, 
                    master = "spark://ip-[MY_PRIVATE_IP]:7077")

sc$master
# "spark://ip-[MY_PRIVATE_IP]:7077"

I've written a post on this topic.

Upvotes: 9

Javier Luraschi
Javier Luraschi

Reputation: 912

As of sparklyr version 0.4, it is unsupported to connect from the RStudio desktop to a remote Spark cluster. Instead, as you mention, the recommended approach is to install RStudio Server within the Spark cluster.

That said, the livy branch in sparklyr is exploring integration with Livy that would enable the RStudio desktop to connect to a remote Spark cluster through Livy.

Upvotes: 8

Related Questions