DJL
DJL

Reputation: 144

Count consecutive month present in the data frame, by group

In R, I need to count, by groups, how many consecutive months exist, leading up to the month of the row. This is a running count that should restart once there was a missing month. Here is an example, with the desired outcome in the result column.

date <- c("2020-01-01", "2020-02-01", "2020-03-01", "2020-05-01", "2020-06-01", "2020-01-01", "2020-03-01", "2020-04-01")
group <- c("a","a","a","a","a","b","b","b")
result <- c(1,2,3,1,2,1,1,2)

data.frame(date=as.Date(date), group=group, result=result)

For group "a," the count broke and restarted from May because April doesn't exist for "a." Same for "b," Feb didn't exist so count restarted from March. How can I get the result column?

Upvotes: 2

Views: 567

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269491

Convert the date to yearmon class from zoo and multiply it by 12 so that successive months are successive integers internally. Then use seqid from collapse to generate a grouping variable and within groups use 1:n() to sequence them.

library(collapse)
library(dplyr)
library(zoo)

DF %>% 
  group_by(group, result = seqid(12 * as.yearmon(date))) %>%
  mutate(result = 1:n()) %>%
  ungroup

This could be done without dplyr like this:

nr <- nrow(DF)
tfm(DF, result = ave(1:nr, group, seqid(12*as.yearmon(date)), FUN = seq_along))

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388947

Here's a dplyr approach :

library(dplyr)

df %>%
  mutate(date = as.Date(date)) %>%
  group_by(group, grp = cumsum(date - lag(date, default = as.Date('1900-01-01')) > 35)) %>%
  mutate(result2 = row_number()) %>%
  ungroup %>%
  select(-grp)

#   date       group result result2
#  <date>     <chr>  <dbl>   <int>
#1 2020-01-01 a          1       1
#2 2020-02-01 a          2       2
#3 2020-03-01 a          3       3
#4 2020-05-01 a          1       1
#5 2020-06-01 a          2       2
#6 2020-01-01 b          1       1
#7 2020-03-01 b          1       1
#8 2020-04-01 b          2       2

I have used the default date in lag as as.Date('1900-01-01') assuming all your dates are greater than that date. I am checking if the number of days between current date and previous date is greater than 35 which is an arbitrary number just to make sure the difference is more than a month.

Upvotes: 2

Anoushiravan R
Anoushiravan R

Reputation: 21908

I've been meaning to come up with a tidyverse solution, but I could not solve the counter index and in the I resorted to write a custom function. This is not anywhere as efficient as the solution proposed by Mr. @ThomasIsCoding but I thought it would be a decent addition to the answers:

library(dplyr)

# First I write a custom function that does the counting based on your desired condition

set_index <- function(x) {
  result <- vector("numeric", length(x))
  result[[1]] <- 1

  for(i in 2:length(x)) {
    if(x[[i]] - x[[i-1]] == 1) {
      result[[i]] <- result[[i-1]] + 1
    } else {
      result[[i]] <- 1
    }
  }
  result
}

# Then I applied it to the month column I extracted out of dates

df %>% 
  mutate(month = month(date), 
         result = set_index(month)) %>%
  select(-month)

        date group result
1 2020-01-01     a      1
2 2020-02-01     a      2
3 2020-03-01     a      3
4 2020-05-01     a      1
5 2020-06-01     a      2
6 2020-01-01     b      1
7 2020-03-01     b      1
8 2020-04-01     b      2

Upvotes: 2

ThomasIsCoding
ThomasIsCoding

Reputation: 101247

Here is a data.table option using rowid + cumsum

setDT(df)[, result := rowid(cumsum(c(TRUE, round(diff(date) / 30.42) != 1))), group]

which gives

         date group result
1: 2020-01-01     a      1
2: 2020-02-01     a      2
3: 2020-03-01     a      3
4: 2020-05-01     a      1
5: 2020-06-01     a      2
6: 2020-01-01     b      1
7: 2020-03-01     b      1
8: 2020-04-01     b      2

Data

> dput(df)
structure(list(date = structure(c(18262, 18293, 18322, 18383,
18414, 18262, 18322, 18353), class = "Date"), group = c("a",
"a", "a", "a", "a", "b", "b", "b")), class = "data.frame", row.names = c(NA,
-8L))

Upvotes: 1

Related Questions