Reputation: 2237
I am trying to run sql
using r
dbplyr
and have come across a bug/wrong results when using last()
function with dbplyr
code. I am not sure if I am doing it wrong or if there is actually a bug.
Objective: To get last Status of each employee from the table using dbplyr.
I have tried using below code to replicate the issue.
1. Libs
library(tidyverse)
library(lubridate)
library(dbplyr)
library(RSQLite)
2. Data
df_emp_status <- data.frame(
Emp_id = c(121,321,451,121,451,451,321,755),
TimeStamp = c('29-07-2019 08:55:55','29-07-2019 09:02:55','29-07-2019 09:05:50',
'29-07-2019 10:05:50','29-07-2019 10:07:50','29-07-2019 10:10:10',
'29-07-2019 10:20:10','29-07-2019 11:00:00'),
Status = c('IN','IN','IN','OUT','OUT','IN','OUT','IN')
)
3. DB connection & Data
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_emp_status_sql <- copy_to(con, df_emp_status, overwrite = TRUE)
4. Objective: To get last Status
of each employee
4.1 R Code
using summarise()
df_emp_status %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
summarise(last_status = last(Status))
# A tibble: 4 × 2
Emp_id last_status
<dbl> <chr>
1 121 OUT
2 321 OUT
3 451 IN
4 755 IN
using mutate
df_emp_status %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
mutate(last_status = last(Status))
# A tibble: 8 × 4
# Groups: Emp_id [4]
Emp_id TimeStamp Status last_status
<dbl> <dttm> <chr> <chr>
1 121 2019-07-29 08:55:55 IN OUT
2 121 2019-07-29 10:05:50 OUT OUT
3 321 2019-07-29 09:02:55 IN OUT
4 321 2019-07-29 10:20:10 OUT OUT
5 451 2019-07-29 09:05:50 IN IN
6 451 2019-07-29 10:07:50 OUT IN
7 451 2019-07-29 10:10:10 IN IN
8 755 2019-07-29 11:00:00 IN IN
Above Results are correct & giving me the expected results:
4.2 SQL / DBPLYR replication (gives wrong results)
since last()
doesn't work with summarise
when using dbplyr
so gonna use mutate()
instead
df_emp_status_sql %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
mutate(last_status = last(Status))
# Source: SQL [8 x 4]
# Database: sqlite 3.39.4 [:memory:]
# Groups: Emp_id
# Ordered by: Emp_id, TimeStamp
Emp_id TimeStamp Status last_status
<dbl> <dbl> <chr> <chr>
1 121 1564390555 IN IN
2 121 1564394750 OUT OUT
3 321 1564390975 IN IN
4 321 1564395610 OUT OUT
5 451 1564391150 IN IN
6 451 1564394870 OUT OUT
7 451 1564395010 IN IN
8 755 1564398000 IN IN
Above results are wrong
4.3 Query Crosscheck
df_emp_status_sql %>%
group_by(Emp_id) %>%
arrange(Emp_id,TimeStamp) %>%
mutate(last_status = last(Status)) %>%
show_query()
<SQL>
SELECT
*,
LAST_VALUE(`Status`) OVER (PARTITION BY `Emp_id` ORDER BY `Emp_id`, `TimeStamp`) AS `last_status`
FROM `df_emp_status`
ORDER BY `Emp_id`, `TimeStamp`
I have tried to run above query separately and this also throws Wrong Results.
4.4 Correct Query I was able to find Correct Query that worked for desired results from How to use last_value() in sql:
dbGetQuery(con,'
SELECT
*,
LAST_VALUE(`Status`) OVER (
PARTITION BY `Emp_id`
ORDER BY `Emp_id`, `TimeStamp`
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) AS `last_status`
FROM `df_emp_status`
ORDER BY `Emp_id`, `TimeStamp`
')
Emp_id TimeStamp Status last_status
1 121 1564390555 IN OUT
2 121 1564394750 OUT OUT
3 321 1564390975 IN OUT
4 321 1564395610 OUT OUT
5 451 1564391150 IN IN
6 451 1564394870 OUT IN
7 451 1564395010 IN IN
8 755 1564398000 IN IN
5. Conclusion:
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
As above code seems to be important to get the right results in sql query
with last_value
so the dbplyr translation should also include above code in their r to sql translation
otherwise it will give wrong results.
Upvotes: 3
Views: 94
Reputation: 6931
I can not recall where, but I can recall seeing something about default settings for PRECEDING
and FOLLOWING
. It seems likely that these are determined implicitly when working with dbplyr and hence you may not be able to set them directly.
An alternative approach that will work for this application is lead
. Try something like the following:
df_emp_status_sql %>%
group_by(Emp_id) %>%
arrange(TimeStamp) %>%
mutate(next_status = lead(Status)) %>%
filter(is.na(next_status)) %>%
select(-next_status)
The idea is to use the lead
function to add the next status to each record. The very last record for each id
will have a missing (NA
or NULL
depending on choice of language) value, so you ca filter on this value.
Upvotes: 0