k13
k13

Reputation: 723

When grouping by all columns in a data.table, .SD is empty

I'm having trouble getting consistent output in data.table using consistent syntax. See example below

library(data.table)
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2))
# data.table shown below
#  x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2

d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns Empty data.table (0 rows) of 2 cols: x,y

When all columns are used for grouping in by, .SD is empty, causing an empty data.table to be returned.

When one adds another column, .SD contains columns not being grouped by, the correct output is returned.

   d[, if(.N>1) .SD else NULL, by = x]
   # returns
        x y
     1: 1 1
     2: 1 1
     3: 2 2
     4: 2 2
  
  d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2), t = 1:4)
  d[, if(.N>1) .SD else NULL, by = .(x, y)]
  # returns
      x y t
   1: 1 1 1
   2: 1 1 2
   3: 2 2 3
   4: 2 2 4

I'm trying to find a way to write code to return rows that appear duplicate times that works for both the case where the by columns do and don't consist of all columns in the data.table. Toward this end, I tried setting .SDcols = c("x", "y"). However, the columns get repeated in the output

d[, if(.N>1) .SD else NULL, by = .(x, y), .SDcols = c("x", "y")]
    x y x y
 1: 1 1 1 1
 2: 1 1 1 1
 3: 2 2 2 2
 4: 2 2 2 2

Is there a way to make it so d[, if(.N > 1) .SD else NULL, by = colnames] returns the desired output independent of whether the column names grouped by consist of all columns in 'd'? Or do I need to use an if statement and break up the 2 cases?

Upvotes: 3

Views: 400

Answers (1)

Frank
Frank

Reputation: 66819

Here's one approach

setkey(d,x,y)
dnew <- d[d[,.N>1,by=key(d)][(V1),key(d),with=FALSE]]

This

  1. sets (x,y) to a key;
  2. identifies which (x,y) groups satisfy the criterion; and then
  3. selects those groups from d.

Upvotes: 4

Related Questions