R data.table conditional selection

Question

I have a data.table

set.seed(1)
DT <- data.table(tag = rep(LETTERS[1:4],each = 2, times = 3),
                 year = rep(1:3, each = 8), month = rep(c(1:2), 12),
                 value = runif(24, 1, 10))
DT
    tag year month    value
 1:   A    1     1 3.389578
 2:   A    1     2 4.349115
 3:   B    1     1 6.155680
 4:   B    1     2 9.173870
 5:   C    1     1 2.815137
 6:   C    1     2 9.085507
 7:   D    1     1 9.502077
 8:   D    1     2 6.947180
 9:   A    2     1 6.662026
10:   A    2     2 1.556076
11:   B    2     1 2.853771
12:   B    2     2 2.589011
13:   C    2     1 7.183206
14:   C    2     2 4.456933
15:   D    2     1 7.928573
16:   D    2     2 5.479293
17:   A    3     1 7.458567
18:   A    3     2 9.927155
19:   B    3     1 4.420317
20:   B    3     2 7.997007
21:   C    3     1 9.412347
22:   C    3     2 2.909283
23:   D    3     1 6.865064
24:   D    3     2 2.129996

Sort this DT by year, month, and -value:

setorder(DT, year, month, -value)

will produce:

    tag year month    value
 1:   D    1     1 9.502077
 2:   B    1     1 6.155680
 3:   A    1     1 3.389578
 4:   C    1     1 2.815137
 5:   B    1     2 9.173870
 6:   C    1     2 9.085507
 7:   D    1     2 6.947180
 8:   A    1     2 4.349115
 9:   D    2     1 7.928573
10:   C    2     1 7.183206
11:   A    2     1 6.662026
12:   B    2     1 2.853771
13:   D    2     2 5.479293
14:   C    2     2 4.456933
15:   B    2     2 2.589011
16:   A    2     2 1.556076
17:   C    3     1 9.412347
18:   A    3     1 7.458567
19:   D    3     1 6.865064
20:   B    3     1 4.420317
21:   A    3     2 9.927155
22:   B    3     2 7.997007
23:   C    3     2 2.909283
24:   D    3     2 2.129996

I would like the result to be like following:

    tag year month    value
 1:   D    1     1 9.502077
 2:   B    1     1 6.155680
 3:   B    1     2 9.173870
 4:   D    1     2 6.947180
 5:   D    2     1 7.928573
 6:   C    2     1 7.183206
 7:   D    2     2 5.479293
 8:   C    2     2 4.456933
 9:   C    3     1 9.412347
10:   A    3     1 7.458567
11:   A    3     2 9.927155
12:   C    3     2 2.909283

The result DT would have the following property: within each year keep only the two tags that have larger value in month 1. For example, year 1, the two tags that have larger value are D and B, so keep D and B for the whole year 1. year 2 keeps D and C. At each month 1, I need to reselect the rows with the year.

akrun · Accepted Answer

We group by 'year', get the first two elements in 'tag' for 'month' 1 (as it is already ordered), create a logical index using %in% and subset the rows.

DT[, .SD[tag %in% head(tag[month ==1],2)], .(year)]
#    year tag month    value
# 1:    1   D     1 9.502077
# 2:    1   B     1 6.155680
# 3:    1   B     2 9.173870
# 4:    1   D     2 6.947180
# 5:    2   D     1 7.928573
# 6:    2   C     1 7.183206
# 7:    2   D     2 5.479293
# 8:    2   C     2 4.456933
# 9:    3   C     1 9.412347
#10:    3   A     1 7.458567
#11:    3   A     2 9.927155
#12:    3   C     2 2.909283

R data.table conditional selection

Answers (1)

Related Questions