Reputation: 515
Say I have the following data.table:
library(data.table)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0)
Which returns something like:
R Seq
1: 1 0
2: 1 0
3: 0 0
4: 0 0
5: 1 0
---
9996: 1 0
9997: 0 0
9998: 0 0
9999: 0 0
10000: 1 0
I want to generate a sequence (1, 2, 3,..., n) that resets whenever R changes from the previous row. Think of it like I'm counting a streak of random numbers.
So the above would then look like:
R Seq
1: 1 1
2: 1 2
3: 0 1
4: 0 2
5: 1 1
---
9996: 1 5
9997: 0 1
9998: 0 2
9999: 0 3
10000: 1 2
Thoughts?
Upvotes: 8
Views: 9948
Reputation: 375
Old question, but just in case someone needs a faster and easier way:
DT[, Seq := rowid(rleid(R))]
Explanation:
rleid
creates an index incrementing every time a new group of consecutive values is encountered. So rleid(c('a','a','b','b','a','a'))
returns 1 1 2 2 3 3
rowid
creates for each value an index that gets incremented every time this value is repeated (but not necessarily consecutive). So rowid(c('a','a','b','b','a','a'))
returns 1 2 1 2 3 4
It only takes a fraction of a second on this example with 10 million rows.
Upvotes: 2
Reputation: 52637
Here is an option:
set.seed(1)
DT <- data.table(R=sample(0:1, 10000, rep=TRUE), Seq=0L)
DT[, Seq:=seq(.N), by=list(cumsum(c(0, abs(diff(R)))))]
DT
We create a counter that increments every time your 0-1 variable changes using cumsum(abs(diff(R)))
. The c(0,
part is to ensure we get the correct length vector. Then we split by it with by
. This produces:
R Seq
1: 0 1
2: 0 2
3: 1 1
4: 1 2
5: 0 1
---
9996: 1 1
9997: 0 1
9998: 1 1
9999: 1 2
10000: 1 3
EDIT: Addressing request for clarification:
lets look at the computation I'm using in by
, broken down into two new columns:
DT[, diff:=c(0, diff(R))]
DT[, cumsum:=cumsum(abs(diff))]
print(DT, topn=10)
Produces:
R Seq diff cumsum
1: 0 1 0 0
2: 0 2 0 0
3: 1 1 1 1
4: 1 2 0 1
5: 0 1 -1 2
6: 1 1 1 3
7: 1 2 0 3
8: 1 3 0 3
9: 1 4 0 3
10: 0 1 -1 4
---
9991: 1 2 0 5021
9992: 1 3 0 5021
9993: 1 4 0 5021
9994: 1 5 0 5021
9995: 0 1 -1 5022
9996: 1 1 1 5023
9997: 0 1 -1 5024
9998: 1 1 1 5025
9999: 1 2 0 5025
10000: 1 3 0 5025
You can see how the cumulative sum of the absolute of the diff increments by one each time R changes. We can then use that cumsum
column to break up the data.table
into chunks, and for each chunk, generate a sequence using seq(.N)
that counts to the number of items in the chunk (.N
represents exactly that, how many items in each by
group).
Upvotes: 9