Reputation: 122
I'm working on a Spark environment, and I'm trying to manipulate some data that comes as tbl_spark. The problem is I can't apply any usual data manipulation functions to it.
I've used df <- spark_read_table(sc,"tb_krill_sensordatatable_phoenix")
to import it, and it seems successful, however, when I try to pivot it with tidyr::spread()
, it says the method is not applicable for tbl_spark's.
What I'm trying now is to make: df_tbl <- as_tibble(df)
. However, it's been running for hours now and nothing happened.
I don't know if I should have used another function to import it, other than spark_read_table(), or if I should convert to another usual dataframe format in R.
df_phoenix <- spark_read_table(sc,"tb_krill_sensordatatable_phoenix")
class(df_phoenix)
# [1] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
base_spread <- df_phoenix %>%
spread(key = sensorname, value = sensorvalue)
#Error in UseMethod("spread_") :
# no applicable method for 'spread_' applied to an object of class "c('tbl_spark', 'tbl_sql', 'tbl_lazy', 'tbl')"
aux <- as_tibble(df_phoenix)
#this one takes forever and nothing happens
Upvotes: 3
Views: 3455
Reputation: 6222
Maybe try
base_spread <- df_phoenix %>%
sdf_pivot(sensorvalue ~ sensorname, fun.aggregate = list(Value = "first"))
to get the same functionality as the tidyr:spread
.
You have to get the data into R, if you want to use the tidyr
functions, which can be done using
df <- df_phoenix %>% collect()
Upvotes: 2