Reputation: 144
In R, I need to count, by groups, how many consecutive months exist, leading up to the month of the row. This is a running count that should restart once there was a missing month. Here is an example, with the desired outcome in the result column.
date <- c("2020-01-01", "2020-02-01", "2020-03-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-03-01", "2020-04-01")
group <- c("a","a","a","a","a","b","b","b")
result <- c(1,2,3,1,2,1,1,2)
data.frame(date=as.Date(date), group=group, result=result)
For group "a," the count broke and restarted from May because April doesn't exist for "a." Same for "b," Feb didn't exist so count restarted from March. How can I get the result column?
Upvotes: 2
Views: 567
Reputation: 269491
Convert the date to yearmon class from zoo and multiply it by 12 so that successive months are successive integers internally. Then use seqid from collapse to generate a grouping variable and within groups use 1:n() to sequence them.
library(collapse)
library(dplyr)
library(zoo)
DF %>%
group_by(group, result = seqid(12 * as.yearmon(date))) %>%
mutate(result = 1:n()) %>%
ungroup
This could be done without dplyr like this:
nr <- nrow(DF)
tfm(DF, result = ave(1:nr, group, seqid(12*as.yearmon(date)), FUN = seq_along))
Upvotes: 1
Reputation: 388947
Here's a dplyr
approach :
library(dplyr)
df %>%
mutate(date = as.Date(date)) %>%
group_by(group, grp = cumsum(date - lag(date, default = as.Date('1900-01-01')) > 35)) %>%
mutate(result2 = row_number()) %>%
ungroup %>%
select(-grp)
# date group result result2
# <date> <chr> <dbl> <int>
#1 2020-01-01 a 1 1
#2 2020-02-01 a 2 2
#3 2020-03-01 a 3 3
#4 2020-05-01 a 1 1
#5 2020-06-01 a 2 2
#6 2020-01-01 b 1 1
#7 2020-03-01 b 1 1
#8 2020-04-01 b 2 2
I have used the default date in lag
as as.Date('1900-01-01')
assuming all your dates are greater than that date. I am checking if the number of days between current date and previous date is greater than 35 which is an arbitrary number just to make sure the difference is more than a month.
Upvotes: 2
Reputation: 21908
I've been meaning to come up with a tidyverse solution, but I could not solve the counter index and in the I resorted to write a custom function. This is not anywhere as efficient as the solution proposed by Mr. @ThomasIsCoding but I thought it would be a decent addition to the answers:
library(dplyr)
# First I write a custom function that does the counting based on your desired condition
set_index <- function(x) {
result <- vector("numeric", length(x))
result[[1]] <- 1
for(i in 2:length(x)) {
if(x[[i]] - x[[i-1]] == 1) {
result[[i]] <- result[[i-1]] + 1
} else {
result[[i]] <- 1
}
}
result
}
# Then I applied it to the month column I extracted out of dates
df %>%
mutate(month = month(date),
result = set_index(month)) %>%
select(-month)
date group result
1 2020-01-01 a 1
2 2020-02-01 a 2
3 2020-03-01 a 3
4 2020-05-01 a 1
5 2020-06-01 a 2
6 2020-01-01 b 1
7 2020-03-01 b 1
8 2020-04-01 b 2
Upvotes: 2
Reputation: 101247
Here is a data.table
option using rowid
+ cumsum
setDT(df)[, result := rowid(cumsum(c(TRUE, round(diff(date) / 30.42) != 1))), group]
which gives
date group result
1: 2020-01-01 a 1
2: 2020-02-01 a 2
3: 2020-03-01 a 3
4: 2020-05-01 a 1
5: 2020-06-01 a 2
6: 2020-01-01 b 1
7: 2020-03-01 b 1
8: 2020-04-01 b 2
Data
> dput(df)
structure(list(date = structure(c(18262, 18293, 18322, 18383,
18414, 18262, 18322, 18353), class = "Date"), group = c("a",
"a", "a", "a", "a", "b", "b", "b")), class = "data.frame", row.names = c(NA,
-8L))
Upvotes: 1