create a cumulative count of adjacent repetitions in a column

Question

I want to create a cumulative count of adjacent repetitions in a column. For instance the desired output for repeat_n in the code below should be c(1,2,3,1,2,1,2), but instead I am getting c(1,2,2,1,2,1,2). Perhaps because case_when() is vectorized, case_when() evaluates all values simultaneously rather sequentially evaluating the updated values. How should I avoid this problem?

library(dplyr)

tibble(x = c(1,1,1,0,0,1,1)) %>% 
  mutate(
    repeat_n = 1, 
    repeat_n = 
      case_when(
        x == lag(x) ~ lag(repeat_n) + 1,
        TRUE ~ repeat_n
      )
  )

Ronak Shah · Accepted Answer

You can use rleid in data.table to create groups and then create a column with row number for each group.

library(data.table)
setDT(df)[, repeat_n := seq_len(.N), rleid(x)]
df
#   x repeat_n
#1: 1        1
#2: 1        2
#3: 1        3
#4: 0        1
#5: 0        2
#6: 1        1
#7: 1        2

For the same purpose, we can use lag in dplyr to create groups.

df %>%
  group_by(gr = cumsum(x != lag(x, default = first(x)))) %>%
  mutate(repeat_n = row_number())

Another alternative in data.table as suggested by @chinsoon12

setDT(df)[, repeat_n := rowid(rleid(x))]

This does not require formation of any groups and should be faster.

In base R, we can use rle with sequence

with(rle(df$x), sequence(lengths))
#[1] 1 2 3 1 2 1 2

data

df <- tibble(x = c(1,1,1,0,0,1,1))

create a cumulative count of adjacent repetitions in a column

Answers (1)

Related Questions