xiaodai
xiaodai

Reputation: 16004

How to count the number of times an element appears consecutively in a data.table?

I have a data.table that looks like this

ID, Order, Segment
1, 1, A
1, 2, B
1, 3, B
1, 4, C
1, 5, B
1, 6, B
1, 7, B
1, 8, B

Basically by ordering the data using the Order column. I would like to understand the number of consecutive B's for each of the ID's. Ideally the output I would like is

ID, Consec
1, 2
1, 4

Because the segment B appears consecutively in row 2 and 3 (2 times), and then again in row 5,6,7,8 (4 times).

The loop solution is quite obvious but would also be very slow.

Are there elegant solutions in data.table that is also fast?

P.S. The data I am dealing with has ~20 million rows.

Upvotes: 5

Views: 357

Answers (1)

akrun
akrun

Reputation: 886938

Try

 library(data.table)#v1.9.5+
  DT[order(ID, Order)][, indx:=rleid(Segment)][Segment=='B',
    list(Consec=.N), by = list(indx, ID)][,indx:=NULL][]

 #    ID Consec
 #1:  1      2
 #2:  1      4

Or as @eddi suggested

 DT[order(ID, Order)][, .(Consec = .N), by = .(ID, Segment, 
              rleid(Segment))][Segment == 'B', .(ID, Consec)]
 #    ID Consec
 #1:  1      2
 #2:  1      4

A more memory efficient method would be to use setorder instead of order (as suggested by @Arun)

  setorder(DT, ID, Order)[, .(Consec = .N), by = .(ID, Segment, 
                rleid(Segment))][Segment == 'B', .(ID, Consec)]
  #   ID Consec
  #1:  1      2
  #2:  1      4

Upvotes: 10

Related Questions