Reputation: 9485
I'm working with some data with id, month, date. I'd like to have an average difference in id and for each month (so two groups). I've read this post, and I've tried to modify the answer (it is made only for ID, not month), without luck.
My data are something like:
test <-structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"),
month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4),
date = structure(c(17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555,
17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555,
17555, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 17579,
17579, 17579, 17579, 17579, 17618, 17618, 17618, 17618, 17618,
17618, 17618, 17618, 17618, 17618, 17618, 17621, 17621, 17621,
17621, 17621, 17621, 17621, 17621, 17621, 17621, 17621, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649), class = "Date")),class="data.frame",row.names = c(NA,-98L))
And the result is something like(sorry for the dput()
, but it was the less painful way to share the example of the data).
head(test)
id month date
1 1 1 2018-01-24
2 1 1 2018-01-24
3 1 1 2018-01-24
4 1 1 2018-01-24
5 1 1 2018-01-24
6 1 1 2018-01-24
So I've tried this:
library(dplyr)
test %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(mean(diff(date))))%>%data.frame()
And the result is:
> result
id month avg
1 1 1 0.0000000
2 1 2 0.0000000
3 1 3 0.1428571
4 1 4 0.0000000
But, looking at the data, March has a problem, because the days of March are the 31 and 28, their difference is 3, and the average of that difference should be 3 (only one distance).
> table(test[which(test$month==3),]$date)
2018-03-28 2018-03-31
11 11
What Am I doing wrong?
Thanks in advance
Upvotes: 1
Views: 71
Reputation: 1254
The result you obtained is correct: diff(date)
computes the difference between all pairs of consecutive dates in the data (within the groups and after you have sorted the dates). In March, you have 11 times 2018-03-28
and 11 times 2018-03-31
. So in March, diff(date)
is 10 times 0, one time 3, and 10 times 0. Thus the average is 3/21=0.143.
Maybe you first want to consider distinct combinations of (id, month, date)
:
test %>%
distinct(id, month, date) %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(mean(diff(date)))) %>%
data.frame()
Note that this outputs 3 for March, but NaN
for the other months since you are asking for the diff
on a vector of length 1, which gives a vector of length 0. As an alternative, you can use
test %>%
distinct(id, month, date) %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(max(date)-min(date)) / max(1, n()-1))
Upvotes: 3