R data.table: keep column when grouping by expression

Question

When grouping by an expression involving a column (e.g. DT[...,.SD[c(1,.N)],by=expression(col)]), I want to keep the value of col in .SD.

For example, in the following I am grouping by the remainder of a divided by 3, and keeping the first and last observation in each group. However, a is no longer present in .SD

f <- function(x) x %% 3

Q <- data.table(a = 1:20, x = rnorm(20), y = rnorm(20))
Q[, .SD[c(1., .N)], by = f(a)]

   f         x          y
1: 1 0.2597929  1.0256259
2: 1 2.1106619 -1.4375193
3: 2 1.2862501  0.7918292
4: 2 0.6600591 -0.5827745
5: 0 1.3758503  1.3122561
6: 0 2.6501140  1.9394756

The desired output is as if I had done the following

Q[, f := f(a)]
tmp <- Q[, .SD[c(1, .N)], by=f]
Q[, f := NULL]
tmp[, f := NULL]
tmp

    a         x          y
1:  1 0.2597929  1.0256259
2: 19 2.1106619 -1.4375193
3:  2 1.2862501  0.7918292
4: 20 0.6600591 -0.5827745
5:  3 1.3758503  1.3122561
6: 18 2.6501140  1.9394756

Is there a way to do this directly, without creating a new variable and creating a new intermediate data.table?

akrun · Accepted Answer

Instead of .SD, use .I to get the row index, extract that column ($V1) and subset the original dataset

library(data.table)
Q[Q[, .I[c(1., .N)], by = f(a)]$V1]
#    a          x          y
#1:  1  0.7265238  0.5631753
#2: 19  1.7110611 -0.3141118
#3:  2  0.1643566 -0.4704501
#4: 20  0.5182394 -0.1309016
#5:  3 -0.6039137  0.1349981
#6: 18  0.3094155 -1.1892190

NOTE: The values in columns 'x', 'y' would be different as there was no set.seed

R data.table: keep column when grouping by expression

Answers (1)

Related Questions