greener
greener

Reputation: 21

Running correlations in SparkR: no method for coercing this S4 class to a vector

I've recently started using SparkR and would like to run some correlation analysis with it. I'm able to upload content in as a SparkR dataframe but it doesn't permit to run simple cor() analysis with the data frame. (Getting an S4 error below):

usr/local/src/spark/spark-1.5.1/bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
library(SparkR)

setwd('/DATA/')

Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')

sqlContext <- sparkRSQL.init(sc)

df <- read.df(sqlContext, "/DATA/GSE45291/GSE45291.csv", source = "com.databricks.spark.csv", inferSchema = "true")

results <- cor(as.data.matrix(df), type="pearson")

data.matrix(df)Error in as.vector(data) : no method for coercing this S4 class to a vector

Is there no built-in correlation function for SparkR? How can I fix the S4 object to work in R where I can perform base functions? Any suggestions folks have is appreciated. Thanks -Rich

Upvotes: 2

Views: 4285

Answers (1)

zero323
zero323

Reputation: 330413

Spark < 1.6

How can I fix the S4 object to work in R where I can perform base functions?

You simply cannot. Spark data frames are not a drop in replacement for standard R data.frame. If you want, you can collect to local R data.frame, but most of the time it won't be a feasible solution.

You can use an UDF to compute correlation between individual columns. First you'll need a Hive context:

sqlContext <- sparkRHive.init(sc)

and some dummy data:

ldf <- iris[, -5]
colnames(ldf) <- tolower(gsub("\\.", "_", colnames(ldf)))
sdf <- createDataFrame(sqlContext, ldf)

Next you have to register temporary table:

registerTempTable(sdf, "sdf")

Now you can use SQL query like this:

q <- sql(sqlContext, "SELECT corr(sepal_length, sepal_width) FROM sdf")
head(q)
##          _c0
## 1 -0.1175698

Spark >= 1.6

You can use cor function on a DataFrame directly.

Upvotes: 2

Related Questions