Reputation: 21
I've recently started using SparkR and would like to run some correlation analysis with it. I'm able to upload content in as a SparkR dataframe but it doesn't permit to run simple cor() analysis with the data frame. (Getting an S4 error below):
usr/local/src/spark/spark-1.5.1/bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
library(SparkR)
setwd('/DATA/')
Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
sqlContext <- sparkRSQL.init(sc)
df <- read.df(sqlContext, "/DATA/GSE45291/GSE45291.csv", source = "com.databricks.spark.csv", inferSchema = "true")
results <- cor(as.data.matrix(df), type="pearson")
data.matrix(df)Error in as.vector(data) : no method for coercing this S4 class to a vector
Is there no built-in correlation function for SparkR? How can I fix the S4 object to work in R where I can perform base functions? Any suggestions folks have is appreciated. Thanks -Rich
Upvotes: 2
Views: 4285
Reputation: 330413
Spark < 1.6
How can I fix the S4 object to work in R where I can perform base functions?
You simply cannot. Spark data frames are not a drop in replacement for standard R data.frame
. If you want, you can collect to local R data.frame
, but most of the time it won't be a feasible solution.
You can use an UDF to compute correlation between individual columns. First you'll need a Hive context:
sqlContext <- sparkRHive.init(sc)
and some dummy data:
ldf <- iris[, -5]
colnames(ldf) <- tolower(gsub("\\.", "_", colnames(ldf)))
sdf <- createDataFrame(sqlContext, ldf)
Next you have to register temporary table:
registerTempTable(sdf, "sdf")
Now you can use SQL query like this:
q <- sql(sqlContext, "SELECT corr(sepal_length, sepal_width) FROM sdf")
head(q)
## _c0
## 1 -0.1175698
Spark >= 1.6
You can use cor
function on a DataFrame
directly.
Upvotes: 2