Reputation: 545
Not sure my title is as clear as I would like. Below is some code written out using loops:
for(i in 1:length(variantTable[,1])){
#N stores counts of numbers of population totals of populations that contain the variant in question
N = 0
#NF.pop stores frequencies
NF.EAS = 0
NF.AMR = 0
NF.EUR = 0
NF.SAS = 0
if(variantTable[i,]$EAS_MAF > 0){
NF.EAS = EAScount * variantTable[i,]$EAS_MAF
N = N + EAScount
}
if(variantTable[i,]$AMR_MAF > 0){
NF.AMR = AMRcount * variantTable[i,]$AMR_MAF
N = N + AMRcount
}
if(variantTable[i,]$EUR_MAF > 0){
NF.EUR = EURcount * variantTable[i,]$EUR_MAF
N = N + EURcount
}
if(variantTable[i,]$SAS_MAF > 0){
NF.SAS = SAScount * variantTable[i,]$SAS_MAF
N = N + SAScount
}
variantTable[i,]$nonAFR_N <- N
variantTable[i,]$nonAFR_weighted <- (NF.EAS + NF.AMR + NF.EUR + NF.SAS)/N
}
As you can see variantTable[i,]$nonAFR_weighted
is calculated based on conditions across multiple columns (EAS_MAF, AMR_MAF, AFR_MAF, EUR_MAF, SAS_MAF).
I am aware that loops are not the quickest way to do something like this in R, particularly considering the fact that my dataset consists of 900000 rows.
I have just started working with ifelse and apply methods but am unsure how to use them in a situation such as this. I did try to create a function that simply takes in one row and calculates values for that row, and then using the apply method but this hasn't worked as I am not sure what the input should be.
Any advice on how best to proceed with a problem such as this?
EDIT: Here is the dput of my data:
> dput(head(variantTable))
structure(list(CHROM = c("1", "1", "1", "1", "1", "1"), POS = c(69224L,
69428L, 69486L, 69487L, 69496L, 69521L), ID = c("rs568964432",
"rs140739101", "rs548369610", "rs568226429", "rs150690004", "rs553724620"
), REF = c("A", "T", "C", "G", "G", "T"), ALT = c("T", "G", "T",
"A", "A", "A"), AF = c(0.000399361, 0.0189696, 0.000199681, 0.000399361,
0.000998403, 0.000399361), AC = c(2L, 95L, 1L, 2L, 5L, 2L), AN = c(5008L,
5008L, 5008L, 5008L, 5008L, 5008L), EAS_AF = c(0, 0.003, 0.001,
0, 0, 0), AMR_AF = c(0.0029, 0.036, 0, 0, 0.0014, 0.0029), AFR_AF = c(0,
0.0015, 0, 0.0015, 0.003, 0), EUR_AF = c(0, 0.0497, 0, 0, 0,
0), SAS_AF = c(0, 0.0153, 0, 0, 0, 0), consequence = c("nonsynonymous SNV",
"nonsynonymous SNV", "synonymous SNV", "nonsynonymous SNV", "nonsynonymous SNV",
"nonsynonymous SNV"), gene = c("OR4F5", "OR4F5", "OR4F5", "OR4F5",
"OR4F5", "OR4F5"), refGene_id = c("NM_001005484", "NM_001005484",
"NM_001005484", "NM_001005484", "NM_001005484", "NM_001005484"
), AA_change = c("('D', 'V')", "('F', 'C')", "('N', 'N')", "('A', 'T')",
"('G', 'S')", "('I', 'N')"), X0.fold_count = c(572L, 572L, 572L,
572L, 572L, 572L), X4.fold_count = c(141L, 141L, 141L, 141L,
141L, 141L), EAS_MAF = c(0, 0.003, 0.001, 0, 0, 0), AMR_MAF = c(0.0029,
0.036, 0, 0, 0.0014, 0.0029), AFR_MAF = c(0, 0.0015, 0, 0.0015,
0.003, 0), EUR_MAF = c(0, 0.0497, 0, 0, 0, 0), SAS_MAF = c(0,
0.0153, 0, 0, 0, 0), nonAFR_AF = c(0.0029, 0.104, 0.001, 0, 0.0014,
0.0029), nonAFR_N = c(309227, 1128036, 262551, 0, 309227, 309227
), nonAFR_weighted = c(0.0029, 0.0261704282487438, 0.001, NaN,
0.0014, 0.0029)), .Names = c("CHROM", "POS", "ID", "REF", "ALT",
"AF", "AC", "AN", "EAS_AF", "AMR_AF", "AFR_AF", "EUR_AF", "SAS_AF",
"consequence", "gene", "refGene_id", "AA_change", "X0.fold_count",
"X4.fold_count", "EAS_MAF", "AMR_MAF", "AFR_MAF", "EUR_MAF",
"SAS_MAF", "nonAFR_AF", "nonAFR_N", "nonAFR_weighted"), row.names = c(NA,
6L), class = "data.frame")
the populations counts (EAScount, AMRcount etc) have been defined previously as such:
EAScount <- length(variantTable$EAS_MAF[variantTable$EAS_MAF>0])
AMRcount <- length(variantTable$EAS_MAF[variantTable$AMR_MAF>0])
AFRcount <- length(variantTable$EAS_MAF[variantTable$AFR_MAF>0])
EURcount <- length(variantTable$EAS_MAF[variantTable$EUR_MAF>0])
SAScount <- length(variantTable$EAS_MAF[variantTable$SAS_MAF>0])
The output I am looking for is the calculation for variantTable$nonAFR_n and variantTable$nonAFR_weighted. An example is below with the correct calculation:
> variantTable[2,]
CHROM POS ID REF ALT AF AC AN EAS_AF AMR_AF AFR_AF EUR_AF SAS_AF consequence
2 1 69428 rs140739101 T G 0.0189696 95 5008 0.003 0.036 0.0015 0.0497 0.0153 nonsynonymous SNV
gene refGene_id AA_change X0.fold_count X4.fold_count EAS_MAF AMR_MAF AFR_MAF EUR_MAF SAS_MAF
2 OR4F5 NM_001005484 ('F', 'C') 572 141 0.003 0.036 0.0015 0.0497 0.0153
nonAFR_AF nonAFR_N nonAFR_weighted
2 0.104 1128036 0.02617043
Upvotes: 1
Views: 68
Reputation: 47350
would this work ?
library(dplyr)
variantTable %>% mutate(
NF.EAS = EAScount * EAS_MAF,
NF.AMR = AMRcount * AMR_MAF,
NF.EUR = EURcount * EUR_MAF,
NF.SAS = SAScount * SAS_MAF,
nonAFR_N = EAScount * (EAS_MAF>0) + AMRcount * (AMR_MAF>0) + EURcount * (EUR_MAF>0) + SAScount * (SAS_MAF>0),
nonAFR_weighted = (NF.EAS + NF.AMR + NF.EUR + NF.SAS)/nonAFR_N) %>%
select(-c(NF.EAS,NF.AMR,NF.EUR,NF.SAS))
mutate adds or modify columns to the table, it allows the use of column names without the $
notation. Your if
structure wasn't necessary because by multiplying by zero you get your default zero value anyway, so the first part of your script could be simplified.
Booleans are coerced into integers 0
and 1
when used with arithmetic operators, so I didn't need all the if
structure to compute N
either.
And the last column is quite straightforward.
All these operations were vectorized, meaning I added, subtracted, multiplied, divided columns directly rather than discrete values, it's faster for the machine and easier on the eyes.
Also, more efficient and easier to read:
EAScount <- sum(variantTable$EAS_MAF>0)
AMRcount <- sum(variantTable$AMR_MAF>0)
AFRcount <- sum(variantTable$AFR_MAF>0)
EURcount <- sum(variantTable$EUR_MAF>0)
SAScount <- sum(variantTable$SAS_MAF>0)
Upvotes: 4