Reputation: 151
I am trying to use dplyr::lag to determine the number of days that have passed for each event since the initial event but I am getting unexpected behavior.
Example, very simple data:
df <- data.frame(id = c("1", "1", "1", "1", "2", "2"),
date= c("4/1/2020", "4/2/2020", "4/3/2020", "4/4/2020", "4/17/2020", "4/18/2020"))
df$date <- as.Date(df$date, format = "%m/%d/%Y")
id date
1 1 4/1/2020
2 1 4/2/2020
3 1 4/3/2020
4 1 4/4/2020
5 2 4/17/2020
6 2 4/18/2020
What I was hoping to do was create a new column days_since_first_event that calculated the number of days between the initial event by id and each subsequent date with this expected output
df <- df %>%
group_by(id) %>%
mutate(days_since_first_event = as.numeric(date - lag(date)))
id date days_since_first_event
1 1 4/1/2020 0
2 1 4/2/2020 1
3 1 4/3/2020 2
4 1 4/4/2020 3
5 2 4/17/2020 0
6 2 4/18/2020 1
But instead I get this output
# A tibble: 6 x 3
# Groups: id [2]
id date days_since_first_event
<chr> <date> <dbl>
1 1 2020-04-01 NA
2 1 2020-04-02 1
3 1 2020-04-03 1
4 1 2020-04-04 1
5 2 2020-04-17 NA
6 2 2020-04-18 1
Any suggestions on what I'm doing wrong?
Upvotes: 0
Views: 823
Reputation: 18487
The first n values of lag()
get a default value, because you don't have 'older' data. The default value is NA
. Hence the NA
in your results.
Furthermore, using lag will only yield the difference between consecutive events.
Upvotes: 1