Reputation: 3
I have a table I am trying to normalize by a specific subset of means within the one column based on the variable in another column. Ideally, my code would divide all of the data in the coverage_depth column for a specific strain variable (like 2987) by the mean of a subset of the same column (coverage depth for only the SAG1 in the chr column for only the 2987 in the strain column)
I have found the long way of doing this but I'm really hoping someone has a way to make this a loop so that I don't have to input means by hand after they are calculated.
My table looks like this:
B1 1073 320 2987
B1 1074 324 2987
B1 1075 330 2987
SAG1 955 31 2987
SAG1 956 30 2987
SAG1 957 29 2987
SAG1 958 29 2987
BTub 446 57 2987
BTub 452 59 2987
B1 1707 53 GRE_MIG
B1 1708 56 GRE_MIG
18S 1099 242 GRE_MIG
18S 1100 242 GRE_MIG
SAG1 888 7 GRE_MIG
SAG1 889 7 GRE_MIG
SAG1 890 7 GRE_MIG
First I load in my table:
reads<-read.table("3133_all.CNV.txt", sep = "\t", header = F)
colnames(reads)<-c("chr", "position", "coverage_depth", "strains"
Then I call plyr to calculate the mean of coverage_depth of all the combinations of the chr and strains columns
library(plyr)
coverage_summary<-ddply(reads, c("chr", "strains"), summarise, mean = mean(coverage_depth))
write.csv(format(coverage_summary, scientific=FALSE), file = "CNV_mean_07.27.16.csv", row.names = F)
Which gives me a longer version of this:
chr strains mean
1 18S 2987 2.052802e+03
20 18S GRE_MIG 2.674536e+01
126 B1 GRE_MIG 6.503342e+01
213 SAG1 2987 3.422057e+01
232 SAG1 GRE_MIG 5.863501e+00
I figured out how to normalize all of the coverage_depth of a strain by the mean which I get from that strain at chr SAG1 which I manually put in like so:
NormalizeSAG1<-function(coverage_depth, strains){
if (strains %in% c("2987")) {
coverage_depth<-coverage_depth/3.42
} else if (strains %in% c("GRE_MIG")) {
coverage_depth<-coverage_depth/5.86
} else { coverage.norm<-coverage_depth
}}
reads$SAG1_normalized<-mapply(NormalizeSAG1, reads$coverage_depth, reads$strains)
The problem is that I have 53 different strains that I want to normalize based on the mean at their individual SAG1 in the chr column. It seems like maybe a for loop would do it but I can't figure out how to properly subset my data to normalize without a ton of ifelse statements.
Upvotes: 0
Views: 815
Reputation: 11957
Try the following:
reads <- merge(reads, coverage_summary)
reads <- mutate(reads, normalized = coverage_depth / mean)
Basically, this should join your summary column back into your raw data, after which, creating a normalized column should be trivial. This also avoids having to create a custom function that accounts for 53 different possible values.
Upvotes: 1