eamcvey
eamcvey

Reputation: 693

Most efficient way to label groups based on sequential values in R

I repeatedly encounter this type of task in different contexts in my work. I've used various approaches to address it in the past (usually some awkward combo of lag, diff, etc.), but keep thinking there must be a better, more general, more efficient way. The goal is to label groups in a new variable based on sequential changes in another variable. For example:

var1a <- c("A","A","B","B","B","C","D","D","D","D","D")

should result in a new variable labeling the four groups:

var2a <- c(1, 1, 2, 2, 2, 3, 4, 4, 4, 4, 4)

Somewhat less trivially, this should be based on the grouping of the same values in sequence, not just unique values of var1. For example:

var1b <- c(1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)

should result in a new variable labeling the four groups:

var2b <- c(1, 1, 1, 2, 2, 3, 4, 4, 4, 4, 4, 4)

And to clarify, when I say "efficient" I'm more interested in straightforward/readable/robust/general than in computationally efficient, though that also has some importance.

Upvotes: 1

Views: 162

Answers (3)

IRTFM
IRTFM

Reputation: 263411

And I was going to echo Steve Kern's suggestion to coerce factor to numeric, but use this for the second Q:

> cumsum(c(1, diff(var1b)!=0))
 [1] 1 1 1 2 2 3 4 4 4 4 4 4

I would point out that the question was ambiguous w.r.t. what would be the desired answer tot he first Q for

var1a <- c("A","A","B","B","B","C","D","D","D","D","D", "a", "A", "B", "B")

The rle approach will give a different answer than the factor approach.

Upvotes: 0

jlhoward
jlhoward

Reputation: 59385

You could use run length encoding (?rle):

var1a <- c("A","A","B","B","B","C","D","D","D","D","D")
z     <- rle(var1a)
var2a <- rep(1:length(z$lengths),z$lengths)
var2a
#  [1] 1 1 2 2 2 3 4 4 4 4 4

var1b <- c(1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)
z <- rle(var1b)
var2b <- rep(1:length(z$lengths),z$lengths)
var2b
#  [1] 1 1 1 2 2 3 4 4 4 4 4 4

Or, more generally,

get.groups <- function(x) with(rle(x),rep(1:length(lengths),lengths))
get.groups(var1a)
#  [1] 1 1 2 2 2 3 4 4 4 4 4
get.groups(var1b)
#  [1] 1 1 1 2 2 3 4 4 4 4 4 4

Upvotes: 3

Steve Kern
Steve Kern

Reputation: 596

To answer the first question, I try the following:

var2a <- as.integer(factor(var1a))

For the second question, I would use @jlhoward's suggestion of using rle.

Upvotes: 0

Related Questions