Reputation: 13

Load spark-csv from Rstudio under Windows environment

Can any one tell me if I can import spark-csv package from SparkR using R studio under windows 7 environment? My local machine has R 3.2.2, spark-1.6.1-bin-hadoop2.6 and java installed, but not maven, scala etc. I don't know if I miss anything in order to call spark-csv? Shall I install this package (.jar file) and put in some folder?

Here is my script:

library(rJava)
Sys.setenv(SPARK_HOME = 'C:/Users/***/spark-1.6.1-bin-hadoop2.6')

.libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
library(SparkR)

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.4.0" "sparkr-shell"')

sc <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc)

I was able to call sparkR library and initiate a sc, here is the message:

Launching java with spark-submit command C:/Users/***/spark-1.6.1-bin-hadoop2.6/bin/spark-submit.cmd   --driver-memory "2g" "--packages" "com.databricks:spark-csv_2.11:1.4.0" "sparkr-shell" C:\Users\hwu\AppData\Local\Temp\2\Rtmp46MVve\backend_port13b423eed9c

Then, when I try to load a local csv file, it failed. I put the csv file under R's current working directory already.
flights <- read.df(sqlContext, "nycflights13.csv", "com.databricks.spark.csv", header="true")

I got this error message:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(Unknown Source)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.r...(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:406)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:7

Thank you for any advice.

Upvotes: 1

Answers (2)

desertnaut

Reputation: 60400

Pre-built Spark distributions, like the one you are using, are still built with Scala 2.10, not 2.11. Accordingly, you need a spark-csv build that is for Scala 2.10, not for Scala 2.11 (as the one you use in your code). Change com.databricks:spark-csv_2.11:1.4.0 to com.databricks:spark-csv_2.10:1.4.0, and you should be fine (see also my answer in a relevant SO question).

I have never tested Spark in Windows, but I have recently put together a short demo for using SparkR in RStudio in a blog post, which you might find useful.

Upvotes: 0

xyzzy

Reputation: 329

instead of this:

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.4.0" "sparkr-shell"')

try this:

Sys.setenv(SPARKR_SUBMIT_ARGS="--packages com.databricks:spark-csv_2.11:1.4.0 sparkr-shell"

or perhaps this

sc <- sparkR.init(master="local[*]",appName="yourapp",sparkPackages="com.databricks:spark-csv_2.11:1.4.0")

Upvotes: 0

Load spark-csv from Rstudio under Windows environment

Answers (2)

Related Questions