Vijay Bhoomireddy
Vijay Bhoomireddy

Reputation: 576

RStudio standalone connecting to HDFS

I am having a standalone installation of R and RStudio on my laptop (Windown / Mac) and a Hadoop cluster installed remotely (Linux). I would like to connect to HDFS from RStudio to read data, do the processing and then finally, if required, push the results back to HDFS.

I am not very sure whether this is possible or it only needs a server version of RStudio? Can anyone please suggest on whats the best alternative?

Thanks

Upvotes: 2

Views: 626

Answers (1)

FaceTheDream
FaceTheDream

Reputation: 51

Is it a secure cluster? If it isn't the rwebHDFS package solves this problem. Using it you can use the following code to connect to a remote cluster:

library(rwebhdfs)
hdfs <- webhdfs("<hdfs-webfs-node>", 50070, "<user>")
f <- read_file(hdfs, "/<path>/<to>/<file>")

The packages relies on RCurl, which has limitations (libcurl v1.0.0o on Windows) when used with a secure cluster. To access using a secure cluster I would use the httr package and query the cluster directly using the WebHDFS REST API

# WebHDFS url
hdfsUri <- "http://namenodedns:port/webhdfs/v1"
# Uri of the file you want to read
fileUri <- "/user/username/myfile.csv"
# Optional parameter, with the format &name1=value1&name2=value2
optionnalParameters <- ""

# OPEN => read a file
readParameter <- "?op=OPEN"

# Concatenate all the parameters into one uri
uri <- paste0(hdfsUri, fileUri, readParameter, optionnalParameters)

# Read your file with the function you want as long as it supports reading from a connection
data <- read.csv(uri)

Code taken directly from link

There is no reason to get RStudio Server. I hope this points you in the right direction.

Upvotes: 1

Related Questions