Reputation: 2158
How do you load csv file into SparkR on RStudio? Below are the steps I had to perform to run SparkR on RStudio. I have used read.df to read .csv not sure how else to write this. Not sure if this step is considered to create RDDs.
#Set sys environment variables
Sys.setenv(SPARK_HOME = "C:/Users/Desktop/spark/spark-1.4.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')
#Load libraries
library(SparkR)
library(magrittr)
sc <- sparkR.init(master="local")
sc <- sparkR.init()
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
data <- read.df(sqlContext, "C:/Users/Desktop/DataSets/hello_world.csv", "com.databricks.spark.csv", header="true")
I am getting error:
Error in writeJobj(con, object) : invalid jobj 1
Upvotes: 5
Views: 1995
Reputation: 330413
Spark 2.0.0+:
You can use csv data source:
loadDF(sqlContext, path="some_path", source="csv", header="true")
without loading spark-csv
.
Original answer:
As far as I can tell you're using a wrong version of spark-csv
. Pre-built versions of Spark are using Scala 2.10, but you're using Spark CSV for Scala 2.11. Try this instead:
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.10:1.2.0")
Upvotes: 3
Reputation: 2158
I appreciate everyone's input and solutions!!! I figured out another way to load .csv file into SparkR RStudio. Here it is:
#set sc
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)
#load .csv
patients <- read.csv("C:/...") #Insert your .csv file path
df <- createDataFrame(sqlContext, patients)
df
head(df)
str(df)
Upvotes: 1
Reputation: 5562
I successfully solve this issue by providing the commons-csv-1.2.jar together with the spark-csv package.
Apparently, spark-csv uses commons-csv but is not package with it.
Using the following SPARKR_SUBMIT_ARGS solved the issue (I use --jars rather than --packages).
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--jars" "/usr/lib/spark-1.5.1-bin-hadoop2.6/lib/spark-csv_2.11-1.2.0.jar,/usr/lib/spark-1.5.1-bin-hadoop2.6/lib/commons-csv-1.2.jar" "sparkr-shell"')
In fact, the rather obscure error
Error in writeJobj(con, object) : invalid jobj 1
Is more clear using the R shell directly instead from R Studio and clearly state
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
The needed commons-csv jar can be found here : https://commons.apache.org/proper/commons-csv/download_csv.cgi
Upvotes: 1