Reputation: 892
I have a data.table for which I would like to perform some processing. As an initial step I'd like to set a new data.table
for columns.
I create a loop for columns interested and attempt to assign NA
/0 which fails or has issues as explained below.
library(data.table)
input_allele <- data.table(FID= paste0("gid",1:10),IID=paste0("IID",11:20),PAT=c(1:10),MAT=c(rep(0,10)),SEX=c(rep(1,10)),PHENOTYPE =c(rep(1,10)),
SNP1=(c(rep(1,5), rep(0,5))),SNP2=(c(rep(1,6),rep(0,3),NA)),SNP3=(c(rep(NA,6),rep(1,4))),SNP4=(c(rep(NA,6),rep(0,4))),SNP5=(c(rep(1,6),rep(0,4))) )
multiplied_value<-input_allele[,c(1:6)]
for(temp_snp in (colnames(input_allele[,.SD,.SDcols=c(7:11)]))){
temp_snpquote<-quote(temp_snp)
multiplied_value[,(temp_snpquote):=0]
}
I get an error:
Error in
[.data.table
(multiplied_value, ,:=
((temp_snpquote), 0)) : LHS of := must be a symbol, or an atomic vector (column names or positions).
If I use eval
, I run into a weird behavior: After completion of the loop I have to type multiplied_value
twice before the data.table is printed on the console. This is startling to me.
for(temp_snp in (colnames(input_allele[,.SD,.SDcols=c(7:11)]))){
temp_snpquote<-quote(temp_snp)
multiplied_value[,eval(temp_snpquote):=0]
}
I would like to understand: 1) how to set new column as NA or 0. 2) why using eval
has me type multiplied_value data.table twice it is printed.
R version 4.0.0 (2020-04-24), data.table_1.13.4
Unix debian distribution
Upvotes: 1
Views: 1036
Reputation: 193687
Consolidating some of the comments into an answer here...
From ?set
, you can find that the overhead of calling [.data.table
repeatedly can add up. In those cases, you can try set
instead.
Also, any set*
functions should be followed by []
to print the output.
With that, here are the two alternatives:
copy1 <- copy2 <- copy3 <- input_allele[,c(1:6)]
new <- colnames(input_allele[,.SD,.SDcols=c(7:11)])
## Using `set` :
for (i in new) {
set(copy1, j = i, value = 0)[]
}
head(copy1)
## FID IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
## 1: gid1 IID11 1 0 1 1 0 0 0 0 0
## 2: gid2 IID12 2 0 1 1 0 0 0 0 0
## 3: gid3 IID13 3 0 1 1 0 0 0 0 0
## 4: gid4 IID14 4 0 1 1 0 0 0 0 0
## 5: gid5 IID15 5 0 1 1 0 0 0 0 0
## 6: gid6 IID16 6 0 1 1 0 0 0 0 0
## Using `:=` :
for (i in new) {
copy2[, (i) := 0][]
}
head(copy2)
## FID IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
## 1: gid1 IID11 1 0 1 1 0 0 0 0 0
## 2: gid2 IID12 2 0 1 1 0 0 0 0 0
## 3: gid3 IID13 3 0 1 1 0 0 0 0 0
## 4: gid4 IID14 4 0 1 1 0 0 0 0 0
## 5: gid5 IID15 5 0 1 1 0 0 0 0 0
## 6: gid6 IID16 6 0 1 1 0 0 0 0 0
You could also avoid the loop:
copy3[, (new) := as.list(rep(0, length(new)))][]
## FID IID PAT MAT SEX PHENOTYPE SNP1 SNP2 SNP3 SNP4 SNP5
## 1: gid1 IID11 1 0 1 1 0 0 0 0 0
## 2: gid2 IID12 2 0 1 1 0 0 0 0 0
## 3: gid3 IID13 3 0 1 1 0 0 0 0 0
## 4: gid4 IID14 4 0 1 1 0 0 0 0 0
## 5: gid5 IID15 5 0 1 1 0 0 0 0 0
## 6: gid6 IID16 6 0 1 1 0 0 0 0 0
## 7: gid7 IID17 7 0 1 1 0 0 0 0 0
## 8: gid8 IID18 8 0 1 1 0 0 0 0 0
## 9: gid9 IID19 9 0 1 1 0 0 0 0 0
## 10: gid10 IID20 10 0 1 1 0 0 0 0 0
Note that quote
and eval
are not needed for these.
Even with this small dataset, the performance difference between set
and using :=
in a loop is measurable:
fun1 <- function() { for (i in new) { set(copy1, j = i, value = 0)[] }; copy1 }
fun2 <- function() { for (i in new) { copy2[, (i) := 0][] } ; copy2 }
fun3 <- function() copy3[, (new) := as.list(rep(0, length(new)))][]
bench::mark(fun1(), fun2(), fun3())
## # A tibble: 3 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
## <bch:expr> <bch:t> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
## 1 fun1() 64.9µs 69.63µs 13932. 0B 4.17 6689 2
## 2 fun2() 993µs 1.07ms 910. 377.6KB 4.23 430 2
## 3 fun3() 241.9µs 255.12µs 3793. 16.4KB 4.30 1763 2
## # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## # time <list>, gc <list>
Upvotes: 2