Reputation: 405
I am just trying to find the maximum value of a date column using a data imported using sparklyr in databricks:
data %>% dplyr::select(date_variable) %>% max()
This returns an error:
Error in max(.) : invalid 'type' (list) of argument Error in max(.) : invalid 'type' (list) of argument
If I just do max(data$date_variable)
, it gives -inf
. Surely max()
and min()
have to be one of the most used functions out there? What am I doing wrong here?
Upvotes: 1
Views: 1057
Reputation: 9107
library(sparklyr)
sc <- spark_connect(method = "databricks")
data <- sdf_sql(sc, "SELECT DATE('2020-01-01') AS date_variable UNION SELECT DATE('2020-01-02') AS date_variable")
data
is a Spark data frame. The result of select
is still a Spark data frame which can't be used with max
.
data %>% dplyr::select(date_variable) %>% class()
#> [1] "tbl_spark" "tbl_sql" "tbl_lazy" "tbl"
There are two main ways to get the max value.
If you use pull
, the result is a Date
vector which can be used with max
.
data %>% dplyr::pull(date_variable) %>% max()
#> [1] "2020-01-02"
The downside of this method is that the computation isn't done in Spark. To do the computation in Spark, use summarise
.
data %>% dplyr::summarise(max_date = max(date_variable))
#> # Source: spark<?> [?? x 1]
#> max_date
#> <date>
#> 1 2020-01-02
Upvotes: 1