Reputation: 357
I have a dataframe of logistic regression summary statistics with column names
"CHR" "SNP" "BP" "A1" "TEST" "NMISS" "OR" "STAT" "P"
I want to make a new dataframe which has three columns:
"SNP" "A1"
and "logOR"
The obvious way to do this would be to make a new column, logOR, then simply subset on those 3 columns.
However, I was wondering if it was possible to perform log(OR) within the subsetting process?
I have tried:
raw<-c("SNP","A1","log(OR)")
data.raw<-data[,raw]
And R was not too impressed with this.
Thanks in advance!
Upvotes: 1
Views: 74
Reputation: 47340
Fastest and cleanest way (imo) is to use base function transform
transform(data,logOR = log(OR))[c("SNP","A1","logOR")]
Bonus
There are other ways to do it, I've benchmarked some against each other and give the result for a small and bigger dataset (1000 lines or 100000).
transform
is the fastest in any case. It's a base function that behaves exactly like mutate in this case.
with
makes less sense to me "philosophically" here, but it's the shortest line after data.table
and perform almost on par with transform
when size goes up.
small data.frames
library(microbenchmark)
n <- 1000
data <- data.frame("CHR"=sample(1:n),"SNP"=sample(1:n),"BP"=sample(1:n),"A1"=sample(1:n),"TEST"=sample(1:n),
"NMISS"=sample(1:n),"OR"=sample(1:n),"STAT"=sample(1:n),"P"=sample(1:n))
data2 <- as.data.table(data)
microbenchmark(
transform = transform(data,logOR = log(OR))[c("SNP","A1","logOR")],
within = within (data,logOR <- log(OR))[c("SNP","A1","logOR")],
with = with (data, data.frame(SNP,A1,logOR=log(OR))), # jkt's solution
mutate = mutate(data,logOR = log(OR))[c("SNP","A1","logOR")], # mutate will behave exactly the same as transform in this case
mutate_p = data %>% mutate(logOR = log(OR)) %>% select(SNP, A1, logOR), # same function but with the pipe syntax as formulated by Craig did in the comments
data.table = as.data.table(data)[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table with conversion
data.table2 = data2[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table without conversion, this adds logOR to data2 however
times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# transform 202.086 243.4945 281.1694 263.3140 286.6725 6781.367 1000
# within 290.919 353.2080 395.3183 373.5580 397.4480 7039.017 1000
# with 279.948 337.8130 406.2508 361.8790 392.1390 7601.388 1000
# mutate 912.040 1056.2610 1215.2035 1107.4010 1185.4395 8148.541 1000
# mutate_p 1283.297 1516.7040 1741.8224 1584.3020 1710.2950 33254.564 1000
# data.table 938.584 1058.5610 1175.6758 1116.7795 1214.4605 5079.035 1000
# data.table2 819.314 935.5755 1086.9992 993.6175 1084.0425 27160.856 1000
bigger data.frames
n <- 100000
...
# Unit: milliseconds
# expr min lq mean median uq max neval
# transform 3.005094 3.320254 3.978661 3.548707 3.815381 14.87116 1000
# within 3.252126 3.618074 4.542457 3.929165 4.275118 99.77254 1000
# with 3.102066 3.413511 4.229389 3.653466 3.937482 89.80346 1000
# mutate 3.803171 4.221853 4.931597 4.474195 4.815546 26.43214 1000
# mutate_p 4.283788 4.754672 5.622917 4.996396 5.366238 92.74237 1000
# data.table 4.831649 6.336141 9.911754 8.212245 12.283330 102.13386 1000
# data.table2 3.997825 4.749894 6.677897 5.456840 6.125562 116.99369 1000
EDIT: Added data.table solutions
Upvotes: 2
Reputation: 946
Using with
is a good and simple way to do this:
dat.raw <- with(data, data.frame(SNP,A1,log(OR)))
Upvotes: 3