Reputation: 723
I'm having trouble getting consistent output in data.table
using consistent syntax. See example below
library(data.table)
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2))
# data.table shown below
# x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns Empty data.table (0 rows) of 2 cols: x,y
When all columns are used for grouping in by
, .SD
is empty, causing an empty data.table
to be returned.
When one adds another column, .SD
contains columns not being grouped by, the correct output is returned.
d[, if(.N>1) .SD else NULL, by = x]
# returns
x y
1: 1 1
2: 1 1
3: 2 2
4: 2 2
d <- data.table(x = c(1,1,2,2), y = c(1,1,2,2), t = 1:4)
d[, if(.N>1) .SD else NULL, by = .(x, y)]
# returns
x y t
1: 1 1 1
2: 1 1 2
3: 2 2 3
4: 2 2 4
I'm trying to find a way to write code to return rows that appear duplicate times that works for both the case where the by columns do and don't consist of all columns in the data.table. Toward this end, I tried setting .SDcols = c("x", "y")
. However, the columns get repeated in the output
d[, if(.N>1) .SD else NULL, by = .(x, y), .SDcols = c("x", "y")]
x y x y
1: 1 1 1 1
2: 1 1 1 1
3: 2 2 2 2
4: 2 2 2 2
Is there a way to make it so d[, if(.N > 1) .SD else NULL, by = colnames]
returns the desired output independent of whether the column names grouped by consist of all columns in 'd'? Or do I need to use an if
statement and break up the 2 cases?
Upvotes: 3
Views: 400
Reputation: 66819
Here's one approach
setkey(d,x,y)
dnew <- d[d[,.N>1,by=key(d)][(V1),key(d),with=FALSE]]
This
(x,y)
to a key; (x,y)
groups satisfy the criterion; and then d
.Upvotes: 4