Scott Ritchie
Scott Ritchie

Reputation: 10543

Data.table filter out specific values in each column (which vary by column)

I have a data.table of measurements, each column has a lower detection limit, (and possibly an upper detection limit)

set.seed(1)
dt <- data.table(id=1:5, A=rnorm(5), B=rnorm(5, mean=2), C=rnorm(5,mean=-1))
setkey(dt, id)
# "randomly" disperse upper an lower limits to measurement columns
dt[3,A := -5]
dt[2,B := -3]
dt[5,B := 7]
dt[1,C := -10]
dt
   id          A         B           C
1:  1 -0.6264538  1.179532 -10.0000000
2:  2  0.1836433 -3.000000  -0.6101568
3:  3 -5.0000000  2.738325  -1.6212406
4:  4  1.5952808  2.575781  -3.2146999
5:  5  0.3295078  7.000000   0.1249309

I want to filter (set to NA) out values in each column of dt which exactly match the lower and upper measurement limits listed in another data.table:

limits <- data.table(measurement=LETTERS[1:3], lower=c(-5,-3,-10), 
                     upper=c(NA, 7, NA))
setkey(limits, measurement)
limits
   measurement lower upper
1:           A    -5    NA
2:           B    -3     7
3:           C   -10    NA

My expected output is:

dt
   id          A        B          C
1:  1 -0.6264538 1.179532         NA
2:  2  0.1836433       NA -0.6101568
3:  3         NA 2.738325 -1.6212406
4:  4  1.5952808 2.575781 -3.2146999
5:  5  0.3295078       NA  0.1249309

I wasn't able to construct a nice solution to this, so at the moment I'm using a clungy for loop to get the job done:

for (i in 1:nrow(dt)) {
  for (j in 2:ncol(dt)) {
    if (is.na(dt[i, j, with=F])) {
      next
    } else if (dt[i, j, with=F] == limits[names(dt)[j]][, lower]) {
      dt[i, j := NA_real_, with=F]
    } else if (is.na(limits[names(dt)[j]][, upper])) {
      next
    } else if (dt[i, j, with=F] == limits[names(dt)[j]][, upper]) {
      dt[i, j := NA_real_, with=F] 
    } else {
      next
    }   
  }
}

But there has to be something nicer and faster? I had a play around with applying each column of the limits data.table to each row of dt, but didn't have any success.

Upvotes: 2

Views: 395

Answers (2)

BrodieG
BrodieG

Reputation: 52677

Here is an alternative:

dt[, 2:length(dt) := lapply(
  2:length(dt), 
  function(x) ifelse(.SD[[x]] %in% limits[x - 1, c(lower, upper)], NA, .SD[[x]])
) ]

Since your rows in limits are in the same order as the columns in dt, you can just cycle through the columns:

   id          A        B          C
1:  1 -0.6264538 1.179532         NA
2:  2  0.1836433       NA -0.6101568
3:  3         NA 2.738325 -1.6212406
4:  4  1.5952808 2.575781 -3.2146999
5:  5  0.3295078       NA  0.1249309

Upvotes: 2

Arun
Arun

Reputation: 118849

First, I'd transpose your limiits data.table as follows:

require(reshape2)
require(data.table)
limits = dcast.data.table(melt(limits, id=1), variable ~ measurement)

#    variable  A  B   C
# 1:    lower -5 -3 -10
# 2:    upper NA  7  NA

Then you can match corresponding columns for i and replace those matches with NA using set as follows:

for (i in 2:ncol(dt)) {
    set(dt, i=which(dt[[i]] %in% limits[[i]]), j=i, value=NA_real_)
}

#    id          A        B          C
# 1:  1 -0.6264538 1.179532         NA
# 2:  2  0.1836433       NA -0.6101568
# 3:  3         NA 2.738325 -1.6212406
# 4:  4  1.5952808 2.575781 -3.2146999
# 5:  5  0.3295078       NA  0.1249309

Upvotes: 6

Related Questions