Reputation: 59970
When using .SD
to apply a function to a subset of dt
's columns I can't seem to find the correct way to handle the situation where I have duplicated column names... e.g.
# Make some data
set.seed(123)
dt <- data.table( matrix( sample(6,16,repl=T) , 4 ) )
setnames(dt , rep( letters[1:2] , 2 ) )
# a b a b
#1: 2 6 4 5
#2: 5 1 3 4
#3: 3 4 6 1
#4: 6 6 3 6
# Use .SDcols to multiply both column 'a' specifying them by numeric position
dt[ , lapply( .SD , `*` , 2 ) , .SDcols = which( names(dt) %in% "a" ) ]
# a a
#1: 4 4
#2: 10 10
#3: 6 6
#4: 12 12
I couldn't get it to work with when .SDcols
was a character vector of column names so I tried numeric positions (which( names(dt) %in% "a" )
gives a vector [1] 1 3
) but it also seems to just multiply the first a
column only. Am I doing something wrong?
.SDcols
Advanced. Specifies the columns of x included in .SD. May be character column names or numeric positions.
These also returned the same result as above...
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = which( names(dt) %in% "a" ) ]
dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
packageVersion("data.table")
#[1] ‘1.8.11’
Upvotes: 3
Views: 621
Reputation: 118799
Consistent subset rules on
data.tables
with duplicate columns. In short, if indices are directly provided, 'j', or in.SDcols
, then just those columns are either returned (or deleted if you provide -.SDcols
or!j
). If instead, column names are given and there are more than one occurrence of that column, then it's hard to decide which to keep and which to remove on a subset. Therefore, to remove, all occurrences of that column are removed, and to keep, always the first column is returned each time. Also closes #5688 and #5008. Note that usingby=
to aggregate on duplicate columns may not give intended result still, as it may not operate on the proper column.
Basically, if you do:
dt[, lapply(.SD, `*`, 2), .SDcols=c("a", "a")]
# a a
# 1: 4 4
# 2: 10 10
# 3: 6 6
# 4: 12 12
It'll still give the unintended result, as it's hard to tell which "a" you're mentioning each time - so choosing the first always.
But if you clearly specify (as you do in your Q):
dt[, lapply(.SD, `*`, 2), .SDcols=which( names(dt) %in% "a" )]
# a a
# 1: 4 8
# 2: 10 6
# 3: 6 12
# 4: 12 6
Upvotes: 1
Reputation: 17189
How about this
dt[, "a"] * 2
## a a.1
## 1 4 8
## 2 10 6
## 3 6 12
## 4 12 6
For more detailed discussion
https://chat.stackoverflow.com/transcript/message/12783493#12783493
Upvotes: 1