User800701
User800701

Reputation: 159

Find sum of one column based on consecutive values of another column in data.table

I have a data.table like the following:

    dput(DT)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), Job = structure(c(6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L), .Label = c("f1", "f2", "f3", "f4", "f5", "h1", "h2", "h3"), class = "factor"), Duration = c(2L, 3L, 4L, 4L, 3L, 2L, 1L, 0L, 2L, 3L, 4L, 5L, 4L, 0L), Outsourced = structure(c(1L,2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("N","Y"), class = "factor")), .Names = c("ID", "Job", "Duration", "Outsourced"), row.names = c(NA, -14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x103003178>)

which gives

         ID      Job     Duration Outsourced
 1:       1       h1        2          N
 2:       1       h2        3          Y
 3:       1       h3        4          Y
 4:       1       f1        4          Y
 5:       1       f2        3          N
 6:       1       f3        2          N
 7:       1       f4        1          N
 8:       1       f5        0          N
 9:       2       h1        2          N
10:       2       h2        3          Y
11:       2       f1        4          Y
12:       2       f2        5          N
13:       2       f3        4          N
14:       2       f4        0          N

I want to have the sum of Duration, for all jobs that have consecutive "Y" in the Outsourced column. Moreover, if activities belong to different ID, they shouldn't be counted as consecutive. One ID may have more than one set of consecutive "Y" jobs.

So for this example, the correct answer would be something like

        ID V1
1:       1 11
2:       2  7

Currently, I use rle to find running lengths of "Y" in the outsourced column, and then I try with ifs to do the rest, but I think this can be done more elegantly... Thank you

Upvotes: 0

Views: 127

Answers (1)

User800701
User800701

Reputation: 159

Following @docendo discimus suggestion from above, I managed to get what I wanted by adding a "unique" statement:

DT[, NewCol := sum(Duration), by = list(ID, rleid(Outsourced))][Outsourced == "N", NewCol := NA]
DT[!is.na(NewCol), unique(NewCol), ID]

EDIT: To cover cases that include many sets of Outsourced activities with the same duration, the second statement should be changed to:

DT[!is.na(NewCol), sum(rle(NewCol)$values), ID]

Upvotes: 1

Related Questions