Reputation: 2917

Parallelize not working sparkR

When I run the following code below:

rdd <- lapply(parallelize(sc, 1:10), function(x) list(a=x, b=as.character(x)))
df <- createDataFrame(sqlContext, rdd)

I get an error message saying

Error in lapply(parallelize(sc, 1:10), function(x) list(a = x, b = as.character(x))) : 
  could not find function "parallelize"

I can create data frames however:

library(magrittr)
library(SparkR)
library(rJava)
Sys.setenv(SPARK_HOME="C:\\Apache\\spark-1.6.0-bin-hadoop2.6") 

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) 

library("SparkR", lib.loc="C:\\Apache\\spark-1.6.0-bin-hadoop2.6\\lib")
library(SparkR) 

sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)

Any reason why parrelelize does not work?

Upvotes: 2

Answers (2)

Karen Hovhannisyan

Reputation: 1248

You must specify package SparkR. Please try this code

rdd <- SparkR:::lapply(SparkR:::parallelize (sc, 1:10), function(x) list(a=x, b=as.character(x)))

Upvotes: 2

zero323

Reputation: 330063

Spark < 2.0:

It doesn't work because since the first official release of SparkR (Spark 1.4.0) RDD API is no longer publicly available. You can check SPARK-7230 (Make RDD API private in SparkR for Spark 1.4) for details.

While some of these methods can be accessed using internal API (via :::) you shouldn't depend on it. With some exceptions these are no longer used by the current code base or actively maintained. What is even more important there are multiple know issues which are rather unlikely to be solved.

If you're interested in a lower level R access you should follow SPARK-7264 ) SparkR API for parallel functions) and SPARK-6817 (DataFrame UDFs in R).

Spark 2.0+:

Spark 2.0 introduced a set of methods (dapply, gapply, lapply) designed to work with R UDFs on top of Spark DataFrame.

Upvotes: 5

Parallelize not working sparkR

Answers (2)

Related Questions