Kenneth Singh
Kenneth Singh

Reputation: 405

Finding maximum value of a date column using sparklyr in databricks

I am just trying to find the maximum value of a date column using a data imported using sparklyr in databricks:

data %>% dplyr::select(date_variable) %>% max()

This returns an error:

Error in max(.) : invalid 'type' (list) of argument Error in max(.) : invalid 'type' (list) of argument

If I just do max(data$date_variable), it gives -inf. Surely max() and min() have to be one of the most used functions out there? What am I doing wrong here?

Upvotes: 1

Views: 1057

Answers (1)

Paul
Paul

Reputation: 9107

library(sparklyr)

sc <- spark_connect(method = "databricks")

data <- sdf_sql(sc, "SELECT DATE('2020-01-01') AS date_variable UNION SELECT DATE('2020-01-02') AS date_variable")

data is a Spark data frame. The result of select is still a Spark data frame which can't be used with max.

data %>% dplyr::select(date_variable) %>% class()
#> [1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"

There are two main ways to get the max value.

If you use pull, the result is a Date vector which can be used with max.

data %>% dplyr::pull(date_variable) %>% max()
#> [1] "2020-01-02"

The downside of this method is that the computation isn't done in Spark. To do the computation in Spark, use summarise.

data %>% dplyr::summarise(max_date = max(date_variable))
#> # Source: spark<?> [?? x 1]
#>   max_date  
#>   <date>    
#> 1 2020-01-02

Upvotes: 1

Related Questions