Reputation: 2917
When I run the following code below:
rdd <- lapply(parallelize(sc, 1:10), function(x) list(a=x, b=as.character(x)))
df <- createDataFrame(sqlContext, rdd)
I get an error message saying
Error in lapply(parallelize(sc, 1:10), function(x) list(a = x, b = as.character(x))) :
could not find function "parallelize"
I can create data frames however:
library(magrittr)
library(SparkR)
library(rJava)
Sys.setenv(SPARK_HOME="C:\\Apache\\spark-1.6.0-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
library("SparkR", lib.loc="C:\\Apache\\spark-1.6.0-bin-hadoop2.6\\lib")
library(SparkR)
sc <- sparkR.init(master="local")
sqlContext <- sparkRSQL.init(sc)
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
df <- createDataFrame(sqlContext, localDF)
Any reason why parrelelize does not work?
Upvotes: 2
Views: 2134
Reputation: 1248
You must specify package SparkR. Please try this code
rdd <- SparkR:::lapply(SparkR:::parallelize (sc, 1:10), function(x) list(a=x, b=as.character(x)))
Upvotes: 2
Reputation: 330063
Spark < 2.0:
It doesn't work because since the first official release of SparkR (Spark 1.4.0) RDD API is no longer publicly available. You can check SPARK-7230 (Make RDD API private in SparkR for Spark 1.4) for details.
While some of these methods can be accessed using internal API (via :::
) you shouldn't depend on it. With some exceptions these are no longer used by the current code base or actively maintained. What is even more important there are multiple know issues which are rather unlikely to be solved.
If you're interested in a lower level R access you should follow SPARK-7264 ) SparkR API for parallel functions) and SPARK-6817 (DataFrame UDFs in R).
Spark 2.0+:
Spark 2.0 introduced a set of methods (dapply
, gapply
, lapply
) designed to work with R UDFs on top of Spark DataFrame
.
Upvotes: 5