R average date difference within two groups

Question

I'm working with some data with id, month, date. I'd like to have an average difference in id and for each month (so two groups). I've read this post, and I've tried to modify the answer (it is made only for ID, not month), without luck.

My data are something like:

test <-structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"), 
                       month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                                1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 
                                3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
                                3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
                                4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
                                4, 4, 4, 4, 4, 4), 
                       date = structure(c(17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 
                                                                     17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 
                                                                     17555, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 
                                                                     17579, 17579, 17579, 17579, 17618, 17618, 17618, 17618, 17618, 
                                                                     17618, 17618, 17618, 17618, 17618, 17618, 17621, 17621, 17621, 
                                                                      17621, 17621, 17621, 17621, 17621, 17621, 17621, 17621, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 
                                                                      17649, 17649, 17649, 17649, 17649), class = "Date")),class="data.frame",row.names = c(NA,-98L))

And the result is something like(sorry for the dput(), but it was the less painful way to share the example of the data).

head(test)
id month       date
1  1     1 2018-01-24
2  1     1 2018-01-24
3  1     1 2018-01-24
4  1     1 2018-01-24
5  1     1 2018-01-24
6  1     1 2018-01-24

So I've tried this:

library(dplyr)
test %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(mean(diff(date))))%>%data.frame()

And the result is:

> result
  id month       avg
1  1     1 0.0000000
2  1     2 0.0000000
3  1     3 0.1428571
4  1     4 0.0000000

But, looking at the data, March has a problem, because the days of March are the 31 and 28, their difference is 3, and the average of that difference should be 3 (only one distance).

> table(test[which(test$month==3),]$date)

2018-03-28 2018-03-31 
        11         11

What Am I doing wrong?
Thanks in advance

Pierre Gramme · Accepted Answer

The result you obtained is correct: diff(date) computes the difference between all pairs of consecutive dates in the data (within the groups and after you have sorted the dates). In March, you have 11 times 2018-03-28 and 11 times 2018-03-31. So in March, diff(date) is 10 times 0, one time 3, and 10 times 0. Thus the average is 3/21=0.143.

Maybe you first want to consider distinct combinations of (id, month, date):

test %>%
  distinct(id, month, date) %>%
  group_by(id,month)%>%
  arrange(date) %>%
  summarize(avg = as.numeric(mean(diff(date)))) %>%
  data.frame()

Note that this outputs 3 for March, but NaN for the other months since you are asking for the diff on a vector of length 1, which gives a vector of length 0. As an alternative, you can use

test %>%
  distinct(id, month, date) %>%
  group_by(id,month)%>%
  arrange(date) %>%
  summarize(avg = as.numeric(max(date)-min(date)) / max(1, n()-1))

R average date difference within two groups

Answers (1)

Related Questions