Reputation: 159
I have a data.table like the following:
dput(DT)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), Job = structure(c(6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L), .Label = c("f1", "f2", "f3", "f4", "f5", "h1", "h2", "h3"), class = "factor"), Duration = c(2L, 3L, 4L, 4L, 3L, 2L, 1L, 0L, 2L, 3L, 4L, 5L, 4L, 0L), Outsourced = structure(c(1L,2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("N","Y"), class = "factor")), .Names = c("ID", "Job", "Duration", "Outsourced"), row.names = c(NA, -14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x103003178>)
which gives
ID Job Duration Outsourced
1: 1 h1 2 N
2: 1 h2 3 Y
3: 1 h3 4 Y
4: 1 f1 4 Y
5: 1 f2 3 N
6: 1 f3 2 N
7: 1 f4 1 N
8: 1 f5 0 N
9: 2 h1 2 N
10: 2 h2 3 Y
11: 2 f1 4 Y
12: 2 f2 5 N
13: 2 f3 4 N
14: 2 f4 0 N
I want to have the sum of Duration
, for all jobs that have consecutive "Y" in the Outsourced column. Moreover, if activities belong to different ID
, they shouldn't be counted as consecutive. One ID
may have more than one set of consecutive "Y" jobs.
So for this example, the correct answer would be something like
ID V1
1: 1 11
2: 2 7
Currently, I use rle
to find running lengths of "Y" in the outsourced column, and then I try with ifs to do the rest, but I think this can be done more elegantly...
Thank you
Upvotes: 0
Views: 127
Reputation: 159
Following @docendo discimus suggestion from above, I managed to get what I wanted by adding a "unique" statement:
DT[, NewCol := sum(Duration), by = list(ID, rleid(Outsourced))][Outsourced == "N", NewCol := NA]
DT[!is.na(NewCol), unique(NewCol), ID]
EDIT: To cover cases that include many sets of Outsourced activities with the same duration, the second statement should be changed to:
DT[!is.na(NewCol), sum(rle(NewCol)$values), ID]
Upvotes: 1