Reputation: 7784
I am trying, in my quest to rewrite old (slow) code with the data.table
package, to figure out the best way to use apply
with a data.table.
I have a data.table with multiple id columns, then multiple columns that have dose-response data in a wide format. I need to generalize the answer because not all data.tables will have the same number of dose-response columns. For simplicity I think the following data.table addresses the issue:
library(data.table)
library(microbenchmark)
set.seed(1234)
DT1 = data.table(unique_id = paste0('id',1:1e6),
dose1 = sample(c(1:9,NA),1e6,replace=TRUE),
dose2 = sample(c(1:9,NA),1e6,replace=TRUE)
)
> DT1
unique_id dose1 dose2
1: id1 2 2
2: id2 7 4
3: id3 7 9
4: id4 7 4
5: id5 9 3
---
999996: id999996 4 3
999997: id999997 NA 3
999998: id999998 4 2
999999: id999999 8 5
1000000: id1000000 6 7
So each row has a unique id, some other ids, and I have left out the response columns, because they will be NA
where the dose columns are NA
. What I need to do is remove rows where all of the dose columns are NA
. I came up with the first option, then realized I could trim it down to the second option.
DT2 <- copy(DT1)
DT3 <- copy(DT1)
len.not.na <- function(x){length(which(!is.na(x)))}
option1 <- function(DT){
DT[,flag := apply(.SD,1,len.not.na),.SDcols=grep("dose",colnames(DT))]
DT <- DT[flag != 0]
DT[ , flag := NULL ]
}
option2 <- function(DT){
DT[ apply(DT[,grep("dose",colnames(DT)),with=FALSE],1,len.not.na) != 0 ]
}
> microbenchmark(op1 <- option1(DT2), op2 <- option2(DT3),times=25L)
Unit: seconds
expr min lq median uq max neval
op1 <- option1(DT2) 8.364504 8.863436 9.145341 11.27827 11.50356 25
op2 <- option2(DT3) 8.291549 8.774746 8.982536 11.15269 11.72199 25
Clearly they two options do about the same thing, with option 1 having a few more steps, but I wanted to test how calling .SD
might slow things down as has been suggested by other posts (for example).
Either way both options are still on the slow side. Any suggestions to speeding things up?
EDIT with comment from @AnandaMahto
DT4 <- copy(DT1)
option3 <- function(DT){
DT[rowSums(DT[,grep("dose",colnames(DT)),with=FALSE]) != 0]
}
> microbenchmark(op2 <- option2(DT3), op3 <- option3(DT4),times=5L)
Unit: milliseconds
expr min lq median uq max neval
op2 <- option2(DT3) 7738.21094 7810.87777 7838.6067 7969.5543 8407.4069 5
op3 <- option3(DT4) 83.78921 92.65472 320.6273 559.8153 783.0742 5
rowSums
is definitely faster. I am happy with the solution unless anyone has something faster.
Upvotes: 4
Views: 2142
Reputation: 193527
My approach would be as follows:
Use rowSums
to find the rows you want to keep:
Dose <- grep("dose", colnames(DT1))
# .. menas "up one level
Flag <- rowSums(is.na(DT1[, ..Dose])) != length(Dose)
DT1[Flag]
Upvotes: 6
Reputation: 49448
DT1[!is.na(dose1) | !is.na(dose2)]
The Reduce
generalization in previous edits was wrong, here's the correct version:
DT1[(!Reduce("*", DT1[, lapply(.SD, is.na), .SDcols = patterns("dose")]))]
Benchmarks
rowsum = function(dt) {
Dose <- grep("dose", colnames(dt))
Flag <- rowSums(is.na(dt[, ..Dose])) != length(Dose)
dt[Flag]
}
reduce = function(dt) {
dt[(!Reduce("*", dt[, lapply(.SD, is.na), .SDcols = patterns("dose")]))]
}
# original data
microbenchmark(rowsum(copy(DT1)), reduce(copy(DT1)), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# rowsum(copy(DT1)) 184.4121 190.9895 238.2935 248.0654 266.5708 10
# reduce(copy(DT1)) 141.2399 172.2020 199.1012 219.4567 424.1526 10
# a lot more columns
for (i in 10:100) DT1[, paste0('dose', i) := sample(c(NA, 1:10), 1e6, T)]
microbenchmark(rowsum(copy(DT1)), reduce(copy(DT1)), times = 10)
#Unit: seconds
# expr min lq median uq max neval
# rowsum(copy(DT1)) 4.160035 4.428527 4.505705 4.754398 4.906849 10
# reduce(copy(DT1)) 3.421675 4.172700 4.507304 4.622355 5.156840 10
So at 100 columns Reduce
still does all right.
Upvotes: 4
Reputation: 8691
Maybe easier to just select all the rows with no NAs into a new table like this. You can amend the 'which' condition depending on your table:
DT2<-(DT1[which(!is.na(DT1$dose1) & !is.na(DT1$dose2)),])
Upvotes: 0