Lynsey
Lynsey

Reputation: 357

Can I manipulate a column whilst subsetting in R?

I have a dataframe of logistic regression summary statistics with column names

"CHR" "SNP" "BP" "A1" "TEST" "NMISS" "OR" "STAT" "P"

I want to make a new dataframe which has three columns:

"SNP" "A1" and "logOR"

The obvious way to do this would be to make a new column, logOR, then simply subset on those 3 columns.

However, I was wondering if it was possible to perform log(OR) within the subsetting process?

I have tried:

raw<-c("SNP","A1","log(OR)")

data.raw<-data[,raw]

And R was not too impressed with this.

Thanks in advance!

Upvotes: 1

Views: 74

Answers (2)

moodymudskipper
moodymudskipper

Reputation: 47340

Fastest and cleanest way (imo) is to use base function transform

transform(data,logOR =  log(OR))[c("SNP","A1","logOR")]

Bonus

There are other ways to do it, I've benchmarked some against each other and give the result for a small and bigger dataset (1000 lines or 100000).

transform is the fastest in any case. It's a base function that behaves exactly like mutate in this case.

with makes less sense to me "philosophically" here, but it's the shortest line after data.table and perform almost on par with transform when size goes up.

small data.frames

library(microbenchmark)    
n <- 1000
data <- data.frame("CHR"=sample(1:n),"SNP"=sample(1:n),"BP"=sample(1:n),"A1"=sample(1:n),"TEST"=sample(1:n),
                   "NMISS"=sample(1:n),"OR"=sample(1:n),"STAT"=sample(1:n),"P"=sample(1:n))
data2 <- as.data.table(data)

microbenchmark(
  transform   = transform(data,logOR =  log(OR))[c("SNP","A1","logOR")],
  within      = within   (data,logOR <- log(OR))[c("SNP","A1","logOR")],
  with        = with     (data, data.frame(SNP,A1,logOR=log(OR))),               # jkt's solution
  mutate      = mutate(data,logOR = log(OR))[c("SNP","A1","logOR")],             # mutate will behave exactly the same as transform in this case
  mutate_p    = data %>% mutate(logOR = log(OR)) %>% select(SNP, A1, logOR),     # same function but with the pipe syntax as formulated by Craig did in the comments
  data.table  = as.data.table(data)[,logOR :=  log(OR)][,.(SNP,A1,logOR)],       # data.table with conversion
  data.table2 = data2[,logOR :=  log(OR)][,.(SNP,A1,logOR)],                     # data.table without conversion, this adds logOR to data2 however
  times = 1000)

# Unit: microseconds
#       expr      min        lq      mean    median        uq       max neval
#   transform  202.086  243.4945  281.1694  263.3140  286.6725  6781.367  1000
#      within  290.919  353.2080  395.3183  373.5580  397.4480  7039.017  1000
#        with  279.948  337.8130  406.2508  361.8790  392.1390  7601.388  1000
#      mutate  912.040 1056.2610 1215.2035 1107.4010 1185.4395  8148.541  1000
#    mutate_p 1283.297 1516.7040 1741.8224 1584.3020 1710.2950 33254.564  1000
#  data.table  938.584 1058.5610 1175.6758 1116.7795 1214.4605  5079.035  1000
# data.table2  819.314  935.5755 1086.9992  993.6175 1084.0425 27160.856  1000

bigger data.frames

n <- 100000
...
# Unit: milliseconds
#        expr      min       lq     mean   median        uq       max neval
#   transform 3.005094 3.320254 3.978661 3.548707  3.815381  14.87116  1000
#      within 3.252126 3.618074 4.542457 3.929165  4.275118  99.77254  1000
#        with 3.102066 3.413511 4.229389 3.653466  3.937482  89.80346  1000
#      mutate 3.803171 4.221853 4.931597 4.474195  4.815546  26.43214  1000
#    mutate_p 4.283788 4.754672 5.622917 4.996396  5.366238  92.74237  1000
#  data.table 4.831649 6.336141 9.911754 8.212245 12.283330 102.13386  1000
# data.table2 3.997825 4.749894 6.677897 5.456840  6.125562 116.99369  1000

EDIT: Added data.table solutions

Upvotes: 2

jkt
jkt

Reputation: 946

Using with is a good and simple way to do this:

dat.raw <- with(data, data.frame(SNP,A1,log(OR)))

Upvotes: 3

Related Questions