spiral01
spiral01

Reputation: 545

R: Avoiding loops when using multiple conditions that are dependent on each other

Not sure my title is as clear as I would like. Below is some code written out using loops:

for(i in 1:length(variantTable[,1])){
  #N stores counts of numbers of population totals of populations that contain the variant in question
  N = 0
  #NF.pop stores frequencies
  NF.EAS = 0
  NF.AMR = 0
  NF.EUR = 0
  NF.SAS = 0
  if(variantTable[i,]$EAS_MAF > 0){
    NF.EAS = EAScount * variantTable[i,]$EAS_MAF
    N = N + EAScount
  }
  if(variantTable[i,]$AMR_MAF > 0){
    NF.AMR = AMRcount * variantTable[i,]$AMR_MAF
    N = N + AMRcount
  }
  if(variantTable[i,]$EUR_MAF > 0){
    NF.EUR = EURcount * variantTable[i,]$EUR_MAF
    N = N + EURcount
  }
  if(variantTable[i,]$SAS_MAF > 0){
    NF.SAS = SAScount * variantTable[i,]$SAS_MAF
    N = N + SAScount
  }
  variantTable[i,]$nonAFR_N <- N

  variantTable[i,]$nonAFR_weighted <- (NF.EAS + NF.AMR + NF.EUR + NF.SAS)/N
}

As you can see variantTable[i,]$nonAFR_weighted is calculated based on conditions across multiple columns (EAS_MAF, AMR_MAF, AFR_MAF, EUR_MAF, SAS_MAF).

I am aware that loops are not the quickest way to do something like this in R, particularly considering the fact that my dataset consists of 900000 rows.

I have just started working with ifelse and apply methods but am unsure how to use them in a situation such as this. I did try to create a function that simply takes in one row and calculates values for that row, and then using the apply method but this hasn't worked as I am not sure what the input should be.

Any advice on how best to proceed with a problem such as this?

EDIT: Here is the dput of my data:

> dput(head(variantTable))
structure(list(CHROM = c("1", "1", "1", "1", "1", "1"), POS = c(69224L, 
69428L, 69486L, 69487L, 69496L, 69521L), ID = c("rs568964432", 
"rs140739101", "rs548369610", "rs568226429", "rs150690004", "rs553724620"
), REF = c("A", "T", "C", "G", "G", "T"), ALT = c("T", "G", "T", 
"A", "A", "A"), AF = c(0.000399361, 0.0189696, 0.000199681, 0.000399361, 
0.000998403, 0.000399361), AC = c(2L, 95L, 1L, 2L, 5L, 2L), AN = c(5008L, 
5008L, 5008L, 5008L, 5008L, 5008L), EAS_AF = c(0, 0.003, 0.001, 
0, 0, 0), AMR_AF = c(0.0029, 0.036, 0, 0, 0.0014, 0.0029), AFR_AF = c(0, 
0.0015, 0, 0.0015, 0.003, 0), EUR_AF = c(0, 0.0497, 0, 0, 0, 
0), SAS_AF = c(0, 0.0153, 0, 0, 0, 0), consequence = c("nonsynonymous SNV", 
"nonsynonymous SNV", "synonymous SNV", "nonsynonymous SNV", "nonsynonymous SNV", 
"nonsynonymous SNV"), gene = c("OR4F5", "OR4F5", "OR4F5", "OR4F5", 
"OR4F5", "OR4F5"), refGene_id = c("NM_001005484", "NM_001005484", 
"NM_001005484", "NM_001005484", "NM_001005484", "NM_001005484"
), AA_change = c("('D', 'V')", "('F', 'C')", "('N', 'N')", "('A', 'T')", 
"('G', 'S')", "('I', 'N')"), X0.fold_count = c(572L, 572L, 572L, 
572L, 572L, 572L), X4.fold_count = c(141L, 141L, 141L, 141L, 
141L, 141L), EAS_MAF = c(0, 0.003, 0.001, 0, 0, 0), AMR_MAF = c(0.0029, 
0.036, 0, 0, 0.0014, 0.0029), AFR_MAF = c(0, 0.0015, 0, 0.0015, 
0.003, 0), EUR_MAF = c(0, 0.0497, 0, 0, 0, 0), SAS_MAF = c(0, 
0.0153, 0, 0, 0, 0), nonAFR_AF = c(0.0029, 0.104, 0.001, 0, 0.0014, 
0.0029), nonAFR_N = c(309227, 1128036, 262551, 0, 309227, 309227
), nonAFR_weighted = c(0.0029, 0.0261704282487438, 0.001, NaN, 
0.0014, 0.0029)), .Names = c("CHROM", "POS", "ID", "REF", "ALT", 
"AF", "AC", "AN", "EAS_AF", "AMR_AF", "AFR_AF", "EUR_AF", "SAS_AF", 
"consequence", "gene", "refGene_id", "AA_change", "X0.fold_count", 
"X4.fold_count", "EAS_MAF", "AMR_MAF", "AFR_MAF", "EUR_MAF", 
"SAS_MAF", "nonAFR_AF", "nonAFR_N", "nonAFR_weighted"), row.names = c(NA, 
6L), class = "data.frame")

the populations counts (EAScount, AMRcount etc) have been defined previously as such:

EAScount <- length(variantTable$EAS_MAF[variantTable$EAS_MAF>0])
AMRcount <- length(variantTable$EAS_MAF[variantTable$AMR_MAF>0])
AFRcount <- length(variantTable$EAS_MAF[variantTable$AFR_MAF>0])
EURcount <- length(variantTable$EAS_MAF[variantTable$EUR_MAF>0])
SAScount <- length(variantTable$EAS_MAF[variantTable$SAS_MAF>0])

The output I am looking for is the calculation for variantTable$nonAFR_n and variantTable$nonAFR_weighted. An example is below with the correct calculation:

> variantTable[2,]
  CHROM   POS          ID REF ALT        AF AC   AN EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF       consequence
2     1 69428 rs140739101   T   G 0.0189696 95 5008  0.003  0.036 0.0015 0.0497 0.0153 nonsynonymous SNV
   gene   refGene_id  AA_change X0.fold_count X4.fold_count EAS_MAF AMR_MAF AFR_MAF EUR_MAF SAS_MAF
2 OR4F5 NM_001005484 ('F', 'C')           572           141   0.003   0.036  0.0015  0.0497  0.0153
  nonAFR_AF nonAFR_N nonAFR_weighted
2     0.104  1128036      0.02617043

Upvotes: 1

Views: 68

Answers (1)

moodymudskipper
moodymudskipper

Reputation: 47350

would this work ?

library(dplyr)
variantTable %>% mutate(
  NF.EAS = EAScount * EAS_MAF,
  NF.AMR = AMRcount * AMR_MAF,
  NF.EUR = EURcount * EUR_MAF,
  NF.SAS = SAScount * SAS_MAF,
  nonAFR_N = EAScount * (EAS_MAF>0) + AMRcount * (AMR_MAF>0) + EURcount * (EUR_MAF>0) + SAScount * (SAS_MAF>0),
  nonAFR_weighted = (NF.EAS + NF.AMR + NF.EUR + NF.SAS)/nonAFR_N) %>%
  select(-c(NF.EAS,NF.AMR,NF.EUR,NF.SAS))

mutate adds or modify columns to the table, it allows the use of column names without the $ notation. Your if structure wasn't necessary because by multiplying by zero you get your default zero value anyway, so the first part of your script could be simplified.

Booleans are coerced into integers 0 and 1 when used with arithmetic operators, so I didn't need all the if structure to compute N either.

And the last column is quite straightforward.

All these operations were vectorized, meaning I added, subtracted, multiplied, divided columns directly rather than discrete values, it's faster for the machine and easier on the eyes.

Also, more efficient and easier to read:

EAScount <- sum(variantTable$EAS_MAF>0)
AMRcount <- sum(variantTable$AMR_MAF>0)
AFRcount <- sum(variantTable$AFR_MAF>0)
EURcount <- sum(variantTable$EUR_MAF>0)
SAScount <- sum(variantTable$SAS_MAF>0)

Upvotes: 4

Related Questions