Reputation: 23
I have ~11,000,000 rows in a data frame, i need to loop through each, do a small calculation and then retrieve the corresponding p-value from a chi-squared distribution using pchisq(). Every time this value is retrieved it is appended to an empty vector which is later on added to the data frame.
This code is very inefficient and took exactly a week to run on the server, i believe that is due to the append() function having to copy the whole vector every time. How can i make this as efficient as possible?
Here is the current loop:
std_err <- NULL
for (i in 1:nrow(father)){
std_err <- append(std_err, pchisq((mother[i,7]-father[i,7])^2/((mother[i,8])^2 + (father[i,8])^2), df=1, lower.tail = F))
}
father[ ,"p_std_err"] <- std_err
write.table(father, "father+standard_error.sumstats", sep = '\t', col.names = T, row.names = F, quote = F)
Upvotes: 0
Views: 442
Reputation: 11878
pchisq()
is vectorized, so you don't need a loop at all. You can just write:
pchisq((mother[, 7] - father[, 7])^2 / (mother[, 8]^2 + father[, 8]^2), df = 1, lower.tail = FALSE)
Upvotes: 6