W. Walter
W. Walter

Reputation: 349

R - identify sequences in a vector

Suppose I have a vector ab containing A's and B's. I want to identify sequences and create a vector v with length(ab) that indicates the sequence length at the beginning and end of a given sequence and NA otherwise.

I have however the restriction that another vector x with 0/1 will indicate that a sequence ends.

So for example:

rep("A", 6)

"A" "A" "A" "A" "A" "A"

x <- c(0,0,1,0,0,0)

0 0 1 0 0 0

should give

v <- c(3 NA 3 3 NA 3)

An example could be the following:

ab <- c(rep("A", 5), "B", rep("A", 3))
"A" "A" "A" "A" "A" "B" "A" "A" "A"
x <- c(rep(0,3),1,0,1,rep(0,3))
0 0 0 1 0 1 0 0 0

Here the output should be:

4 NA NA 4 1 1 3 NA 3

(without the restriction it would be)
5 NA NA NA 5 1 3 NA 3

So far, my code without the restriction looks like this:

ab <- c(rep("A", 5), "B", rep("A", 3))
x <- c(rep(0,3),1,0,1,rep(0,3))

cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?
idx <- which(cng) # where do the  changes take place?
idx <- c(idx,length(ab)) # include the last value
seq_length <- diff(c(0, idx)) # how long are the sequences?

# create v
v <- rep(NA, length(ab))
v[idx] <- seq_length # sequence end
v[idx-(seq_length-1)] <- seq_length # sequence start
v

Does anyone have an idea how I can implement the restriction? (And since my vector has 2 Millions of observations, I wonder whether there would be a more efficient way than my approach) I would appreciate any comments! Many thanks in advance!

Upvotes: 0

Views: 325

Answers (1)

AnilGoyal
AnilGoyal

Reputation: 26238

You may do something like this


x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))

ab <- c(rep("A", 5), "B", rep("A", 4))

#creating result of lengths
res <- as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))

> res
 [1] 4 4 4 4 1 1 1 3 3 3

#creating intermediate NAs
replace(res, with(rle(res), setdiff(seq_along(res), c(length(res) + 1 - cumsum(rev(lengths)),
                                                      cumsum(lengths),
                                                      which(res == 1)))), NA)
 [1]  4 NA NA  4  1  1  1  3 NA  3

As per edited scenario

x <- c(rep(0,3),1,rep(0,2),1,rep(0,3)) 
ab <- c(rep("A", 5), "B", rep("A", 4))
ab[3] <- 'B'

as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))

 [1] 2 2 1 1 1 1 1 3 3 3

ab
 [1] "A" "A" "B" "A" "A" "B" "A" "A" "A" "A"

Upvotes: 1

Related Questions