When grouping by all columns in a data.table, .SD is empty

Question

I'm having trouble getting consistent output in data.table using consistent syntax. See example below

library(data.table)
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2))
# data.table shown below
#  x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2

d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns Empty data.table (0 rows) of 2 cols: x,y

When all columns are used for grouping in by, .SD is empty, causing an empty data.table to be returned.

When one adds another column, .SD contains columns not being grouped by, the correct output is returned.

   d[, if(.N>1) .SD else NULL, by = x]
   # returns
        x y
     1: 1 1
     2: 1 1
     3: 2 2
     4: 2 2
  
  d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2), t = 1:4)
  d[, if(.N>1) .SD else NULL, by = .(x, y)]
  # returns
      x y t
   1: 1 1 1
   2: 1 1 2
   3: 2 2 3
   4: 2 2 4

I'm trying to find a way to write code to return rows that appear duplicate times that works for both the case where the by columns do and don't consist of all columns in the data.table. Toward this end, I tried setting .SDcols = c("x", "y"). However, the columns get repeated in the output

d[, if(.N>1) .SD else NULL, by = .(x, y), .SDcols = c("x", "y")]
    x y x y
 1: 1 1 1 1
 2: 1 1 1 1
 3: 2 2 2 2
 4: 2 2 2 2

Is there a way to make it so d[, if(.N > 1) .SD else NULL, by = colnames] returns the desired output independent of whether the column names grouped by consist of all columns in 'd'? Or do I need to use an if statement and break up the 2 cases?

Frank · Accepted Answer

Here's one approach

setkey(d,x,y)
dnew <- d[d[,.N>1,by=key(d)][(V1),key(d),with=FALSE]]

This

sets (x,y) to a key;
identifies which (x,y) groups satisfy the criterion; and then
selects those groups from d.

When grouping by all columns in a data.table, .SD is empty

Answers (1)

Related Questions