Sparkr Read/Write with HDFS

I am trying to figure out how to read and write arbitrary files to/from HDFS in SparkR.

Set up is:

args <- commandArgs(trailingOnly = T)
MASTER <- args[1]
SPARK_HOME <- args[2]
INPATH <- 'hdfs/path/to/read/or/load/from'
OUTPATH <- 'hdfs/path/to/write/save/to'

Sys.setenv(SPARK_HOME = SPARK_HOME) 
.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths())
library(SparkR)

sparkR.session(master = MASTER)

# How to load RData?
load(paste(INPATH, rObjects.RData, sep = '')

# How to read data?
dat <- read.csv(paste(INPATH, datafile.csv, sep = '')

# Perform operations.....

# How to write?
write.csv(dat, paste(OUTPATH, outdata.csv, sep = '')

I know that these procedures can be done with a shell script, or similar system calls within R, e.g.:

system('hadoop fs -copyToLocal ...')

but, I am intentionally trying to avoid these solutions.

Spark v. 2.0.1

R v. 3.3.2

Edit: Comment below notes this is a possible duplicate-- that question deals more specifically with reading csvs (part of my question), but still unclear how to load .RData or read/write files more generally.

Upvotes: 3

Views: 1506

Answers (1)

Manikanta Mahesh Byra
Manikanta Mahesh Byra

Reputation: 159

To read & write data frame in SparkR use these

sdf <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
write.df(people, path = "people. csv", source = "csv", mode = "overwrite")

To work with rdd use these

rdd <- SparkR:::textFile(sc = sc,path = "path",minPartitions = 4)
SparkR:::saveAsTextFile(X,"path")

Databricks has a good package for working with csv files in SparkR, link

Upvotes: 3

Related Questions