Xenarat
Xenarat

Reputation: 3

Normalizing data based on Subset Mean in R

I have a table I am trying to normalize by a specific subset of means within the one column based on the variable in another column. Ideally, my code would divide all of the data in the coverage_depth column for a specific strain variable (like 2987) by the mean of a subset of the same column (coverage depth for only the SAG1 in the chr column for only the 2987 in the strain column)

I have found the long way of doing this but I'm really hoping someone has a way to make this a loop so that I don't have to input means by hand after they are calculated.

My table looks like this:

B1  1073    320 2987
B1  1074    324 2987
B1  1075    330 2987
SAG1    955 31  2987
SAG1    956 30  2987
SAG1    957 29  2987
SAG1    958 29  2987
BTub    446 57  2987
BTub    452 59  2987
B1  1707    53  GRE_MIG
B1  1708    56  GRE_MIG
18S 1099    242 GRE_MIG
18S 1100    242 GRE_MIG
SAG1    888 7   GRE_MIG
SAG1    889 7   GRE_MIG
SAG1    890 7   GRE_MIG

First I load in my table:

reads<-read.table("3133_all.CNV.txt", sep = "\t", header = F)
colnames(reads)<-c("chr", "position", "coverage_depth", "strains"

Then I call plyr to calculate the mean of coverage_depth of all the combinations of the chr and strains columns

library(plyr)
    coverage_summary<-ddply(reads, c("chr", "strains"), summarise, mean = mean(coverage_depth))
    write.csv(format(coverage_summary, scientific=FALSE), file = "CNV_mean_07.27.16.csv", row.names = F)

Which gives me a longer version of this:

     chr    strains         mean
1    18S       2987 2.052802e+03
20   18S    GRE_MIG 2.674536e+01
126   B1    GRE_MIG 6.503342e+01
213 SAG1       2987 3.422057e+01
232 SAG1    GRE_MIG 5.863501e+00

I figured out how to normalize all of the coverage_depth of a strain by the mean which I get from that strain at chr SAG1 which I manually put in like so:

NormalizeSAG1<-function(coverage_depth, strains){ 
  if (strains %in% c("2987")) {
    coverage_depth<-coverage_depth/3.42
  } else if (strains %in% c("GRE_MIG")) {
    coverage_depth<-coverage_depth/5.86    
  } else { coverage.norm<-coverage_depth
  }}
reads$SAG1_normalized<-mapply(NormalizeSAG1, reads$coverage_depth, reads$strains)

The problem is that I have 53 different strains that I want to normalize based on the mean at their individual SAG1 in the chr column. It seems like maybe a for loop would do it but I can't figure out how to properly subset my data to normalize without a ton of ifelse statements.

Upvotes: 0

Views: 815

Answers (1)

jdobres
jdobres

Reputation: 11957

Try the following:

reads <- merge(reads, coverage_summary)
reads <- mutate(reads, normalized = coverage_depth / mean)

Basically, this should join your summary column back into your raw data, after which, creating a normalized column should be trivial. This also avoids having to create a custom function that accounts for 53 different possible values.

Upvotes: 1

Related Questions