Reputation: 3796
I want to create a cumulative count of adjacent repetitions in a column. For instance the desired output for repeat_n
in the code below should be c(1,2,3,1,2,1,2), but instead I am getting c(1,2,2,1,2,1,2). Perhaps because case_when()
is vectorized, case_when()
evaluates all values simultaneously rather sequentially evaluating the updated values. How should I avoid this problem?
library(dplyr)
tibble(x = c(1,1,1,0,0,1,1)) %>%
mutate(
repeat_n = 1,
repeat_n =
case_when(
x == lag(x) ~ lag(repeat_n) + 1,
TRUE ~ repeat_n
)
)
Upvotes: 1
Views: 58
Reputation: 388982
You can use rleid
in data.table
to create groups and then create a column with row number for each group.
library(data.table)
setDT(df)[, repeat_n := seq_len(.N), rleid(x)]
df
# x repeat_n
#1: 1 1
#2: 1 2
#3: 1 3
#4: 0 1
#5: 0 2
#6: 1 1
#7: 1 2
For the same purpose, we can use lag
in dplyr
to create groups.
df %>%
group_by(gr = cumsum(x != lag(x, default = first(x)))) %>%
mutate(repeat_n = row_number())
Another alternative in data.table
as suggested by @chinsoon12
setDT(df)[, repeat_n := rowid(rleid(x))]
This does not require formation of any groups and should be faster.
In base R, we can use rle
with sequence
with(rle(df$x), sequence(lengths))
#[1] 1 2 3 1 2 1 2
data
df <- tibble(x = c(1,1,1,0,0,1,1))
Upvotes: 3