Reputation: 5951
Lets say I have the following data frame
set.seed(123)
df <- data.frame(var1=(runif(10)>0.5)*1)
var1
could have any type / number of levels not specifically 0 and 1s
I would like to create a var2
which increments by 1 every time var1
changes without using a for loop
Expected result in this case is:
data.frame(var1=(runif(10)>0.5)*1, var2=c(1, 2, 3, 4, 4, 5, 6, 6, 6, 7))
var1 var2
0 1
1 2
0 3
1 4
1 4
0 5
1 6
1 6
1 6
0 7
Another option for the data frame could be:
df <- data.frame(var1=c("a", "a", "1", "0", "b", "b", "b", "c", "1", "1"))
in this case the result should be:
var1 var2
a 1
a 1
1 2
0 3
b 4
b 4
b 4
c 5
1 6
1 6
Upvotes: 15
Views: 13649
Reputation: 206187
As of dplyr 1.1.0
, there is a consecutive_id()
function you can use. It will increment each time a value changes. For example
library(dplyr)
df %>% mutate(var2=consecutive_id(var1))
# var1 var2
# 1 0 1
# 2 1 2
# 3 0 3
# 4 1 4
# 5 1 4
# 6 0 5
# 7 1 6
# 8 1 6
# 9 1 6
# 10 0 7
Upvotes: 1
Reputation: 55
Using dplyr::lag
library(dplyr)
df <- df %>% mutate(var2 = cumsum(row_number() == 1 | (var1 != dplyr::lag(var1))))
Upvotes: 0
Reputation: 3329
I am only copying Martin Morgan's rle()
answer above, but implementing it using tidyverse conventions in order to add the grouping column directly to a dataframe/tibble, which is how I end up using is most of the time.
## Using run-length-encoding, create groups of identical values and put that
## common grouping identifier into a `grp` column.
library(tidyverse)
set.seed(42)
df <- tibble(x = sample(c(0,1), size=20, replace=TRUE, prob = c(0.2, 0.8)))
df %>%
mutate(grp = rle(x)$lengths %>% {rep(seq(length(.)), .)})
#> # A tibble: 20 x 2
#> x grp
#> <dbl> <int>
#> 1 0 1
#> 2 0 1
#> 3 1 2
#> 4 0 3
#> 5 1 4
#> 6 1 4
#> 7 1 4
#> 8 1 4
#> 9 1 4
#> 10 1 4
#> 11 1 4
#> 12 1 4
#> 13 0 5
#> 14 1 6
#> 15 1 6
#> 16 0 7
#> 17 0 7
#> 18 1 8
#> 19 1 8
#> 20 1 8
Upvotes: 6
Reputation: 12559
Here is another solution with base R using inverse.rle()
:
df <- data.frame(var1=c("a", "a", "1", "0", "b", "b", "b", "c", "1", "1"))
r <- rle(as.character(df$var1))
r$values <- seq_along(r$values)
df$var2 <- inverse.rle(r)
Short version:
df$var2 <- with(rle(as.character(df$var1)), rep(seq_along(values), lengths))
Here is a solution with data.table
:
library("data.table")
dt <- data.table(var1=c("a", "a", "1", "0", "b", "b", "b", "c", "1", "1"))
dt[, var2:=rleid(var1)]
Upvotes: 5
Reputation: 46856
These look like a run-length encoding (rle)
x = c("a", "a", "1", "0", "b", "b", "b", "c", "1", "1")
r = rle(x)
with
> rle(x)
Run Length Encoding
lengths: int [1:6] 2 1 1 3 1 2
values : chr [1:6] "a" "1" "0" "b" "c" "1"
This says that the first value ("a") occurred 2 times in a row, then "1" occurred once, etc. What you're after is to create a sequence along the 'lengths', and replicate each element of sequence by the number of times the element occurs, so
> rep(seq_along(r$lengths), r$lengths)
[1] 1 1 2 3 4 4 4 5 6 6
The other answers are semi-deceptive, since they rely on the column being a factor(); they fail when the column is actually a character().
> diff(x)
Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] :
non-numeric argument to binary operator
A work-around would be to map the characters to integers, along the lines of
> diff(match(x, x))
[1] 0 2 1 1 0 0 3 -5 0
Hmm, but having said that I find that rle's don't work on factors!
> f = factor(x)
> rle(f)
Error in rle(factor(x)) : 'x' must be a vector of an atomic type
> rle(as.vector(f))
Run Length Encoding
lengths: int [1:6] 2 1 1 3 1 2
values : chr [1:6] "a" "1" "0" "b" "c" "1"
Upvotes: 13
Reputation: 4534
Building on Mr Flick answer:
df$var2 <- cumsum(c(0,as.numeric(diff(df$var1))!=0))
But if you don't want to use diff
you can still use:
df$var2 <- c(0,cumsum(as.numeric(with(df,var1[1:(length(var1)-1)] != var1[2:length(var1)]))))
It starts at 0, not at 1 but I'm sure you see how to change it if you want to.
Upvotes: 17
Reputation: 206187
How about using diff()
and cumsum()
. For example
df$var2 <- cumsum(c(1,diff(df$var1)!=0))
Upvotes: 14