Saurabh
Saurabh

Reputation: 1626

Flagging the first row by group in data.table

In the data.table below, I want to flag the first row by each group.

  temp_dt <- data.table(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30",
                                     "2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30",
                                     "2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                     "2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
                    group = c(1,1,1,1,1,1,1,1,
                              2,2,6,6,6,8,8,8,
                              3,3,3,3,4,4,4,4))

Following is the expected outcome after adding the flag.

> temp_dt
          date group flag
 1: 2000-01-01     1    1
 2: 2000-03-31     1    0
 3: 2000-07-01     1    0
 4: 2000-09-30     1    0
 5: 2001-01-01     1    0
 6: 2001-03-31     1    0
 7: 2001-07-01     1    0
 8: 2001-09-30     1    0
 9: 2000-01-01     2    1
10: 2000-03-31     2    0
11: 2000-07-01     6    1
12: 2000-09-30     6    0
13: 2001-01-01     6    0
14: 2001-03-31     8    1
15: 2001-07-01     8    0
16: 2001-09-30     8    0
17: 2000-01-01     3    1
18: 2000-03-31     3    0
19: 2000-07-01     3    0
20: 2000-09-30     3    0
21: 2001-01-01     4    1
22: 2001-03-31     4    0
23: 2001-07-01     4    0
24: 2001-09-30     4    0
          date group flag

Here is the solution I tried (it is fast) but it is not working as expected.

temp_dt[, flag := if(identical(.I, 1)) 1 else 0, by = .(group)]

Since I am dealing with millions of rows, performance is key in the solution. I am looking for only data.table solution.

Other solutions provided on SO are too slow for my requirement.

Upvotes: 2

Views: 811

Answers (3)

Billy34
Billy34

Reputation: 2214

You can use rowid function. It will give incremental index starting at 1 for each group defined by given grouping variable(s). Detecting first line is simply done by comparing to 1.

temp_dt[, flag := rowid(group)==1]

As a bonus to detect last row of group (.N stands for number of rows of the current group)

temp_dt[, flag := rowid(group)==.N]

Upvotes: 6

B. Christian Kamgang
B. Christian Kamgang

Reputation: 6529

You can solve your problem as follow:

temp_dt[, flag := +!duplicated(group)]
# or
temp_dt[, flag := match(seq_len(.N), 1, 0), by=group]


          date group  flag
 1: 2000-01-01     1     1
 2: 2000-03-31     1     0
 3: 2000-07-01     1     0
 4: 2000-09-30     1     0
 5: 2001-01-01     1     0
 6: 2001-03-31     1     0
 7: 2001-07-01     1     0
 8: 2001-09-30     1     0
 9: 2000-01-01     2     1
10: 2000-03-31     2     0
11: 2000-07-01     6     1
12: 2000-09-30     6     0
13: 2001-01-01     6     0
14: 2001-03-31     8     1
15: 2001-07-01     8     0
16: 2001-09-30     8     0
17: 2000-01-01     3     1
18: 2000-03-31     3     0
19: 2000-07-01     3     0
20: 2000-09-30     3     0
21: 2001-01-01     4     1
22: 2001-03-31     4     0
23: 2001-07-01     4     0
24: 2001-09-30     4     0

Upvotes: 1

diomedesdata
diomedesdata

Reputation: 1075

This way finds the earliest row by date per group and then sets flag == 1 for those rows.

temp_dt[temp_dt[, .I[date == min(date)], by = .(group)]$V1, flag := 1]

# set the rest of the column to 0
temp_dt[is.na(flag), flag := 0]

Upvotes: 1

Related Questions