How to manipulate Spark Dataframe in R with sparklyr?

Question

I'm working on a Spark environment, and I'm trying to manipulate some data that comes as tbl_spark. The problem is I can't apply any usual data manipulation functions to it.

I've used df <- spark_read_table(sc,"tb_krill_sensordatatable_phoenix") to import it, and it seems successful, however, when I try to pivot it with tidyr::spread(), it says the method is not applicable for tbl_spark's.

What I'm trying now is to make: df_tbl <- as_tibble(df). However, it's been running for hours now and nothing happened.

I don't know if I should have used another function to import it, other than spark_read_table(), or if I should convert to another usual dataframe format in R.


df_phoenix <- spark_read_table(sc,"tb_krill_sensordatatable_phoenix")
class(df_phoenix)
# [1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"  

base_spread <- df_phoenix %>% 
   spread(key = sensorname, value = sensorvalue)
#Error in UseMethod("spread_") : 
#  no applicable method for 'spread_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"

aux <- as_tibble(df_phoenix)
#this one takes forever and nothing happens

kangaroo_cliff · Accepted Answer

Maybe try

base_spread <- df_phoenix %>% 
                sdf_pivot(sensorvalue ~ sensorname, fun.aggregate = list(Value = "first"))

to get the same functionality as the tidyr:spread.

You have to get the data into R, if you want to use the tidyr functions, which can be done using

df <- df_phoenix %>% collect()

How to manipulate Spark Dataframe in R with sparklyr?

Answers (1)

Related Questions