nlp
nlp

Reputation: 151

Using dplyr::lag to calculate days since first event

I am trying to use dplyr::lag to determine the number of days that have passed for each event since the initial event but I am getting unexpected behavior.

Example, very simple data:

df <- data.frame(id = c("1", "1", "1", "1", "2", "2"),
                 date= c("4/1/2020", "4/2/2020", "4/3/2020", "4/4/2020", "4/17/2020", "4/18/2020"))

df$date <- as.Date(df$date, format = "%m/%d/%Y")

id      date
1  1  4/1/2020
2  1  4/2/2020
3  1  4/3/2020
4  1  4/4/2020
5  2 4/17/2020
6  2 4/18/2020

What I was hoping to do was create a new column days_since_first_event that calculated the number of days between the initial event by id and each subsequent date with this expected output

df <- df %>%
group_by(id) %>%
mutate(days_since_first_event = as.numeric(date - lag(date)))

id      date days_since_first_event
1  1  4/1/2020                      0
2  1  4/2/2020                      1
3  1  4/3/2020                      2
4  1  4/4/2020                      3
5  2 4/17/2020                      0
6  2 4/18/2020                      1

But instead I get this output

# A tibble: 6 x 3
# Groups:   id [2]
  id    date       days_since_first_event
  <chr> <date>                      <dbl>
1 1     2020-04-01                     NA
2 1     2020-04-02                      1
3 1     2020-04-03                      1
4 1     2020-04-04                      1
5 2     2020-04-17                     NA
6 2     2020-04-18                      1

Any suggestions on what I'm doing wrong?

Upvotes: 0

Views: 823

Answers (1)

Thierry
Thierry

Reputation: 18487

The first n values of lag() get a default value, because you don't have 'older' data. The default value is NA. Hence the NA in your results.

Furthermore, using lag will only yield the difference between consecutive events.

Upvotes: 1

Related Questions