Reputation: 107
I'm trying to transform the counts in each cell of the below table (filename=abundance_table) to percent relative abundance by dividing the total sum of counts in each column (with the exception of the first) and then multiplying it by 100.
Taxa Sample1 Sample2
Eukaryota;Alveolata;Apicomplexa 1000 500
Eukaryota;Alveolata;Dinophyceae 2000 500
Eukaryota;Alveolata;Unclassified Alveolata 500 1000
Eukaryota;Choanoflagellida;Acanthoecidae 500 1000
Eukaryota;Choanoflagellida;Codonosigidae 1000 2000
and I'm expecting an output table that would look exactly as below:
Taxa Sample1 Sample2
Eukaryota;Alveolata;Apicomplexa 20 10
Eukaryota;Alveolata;Dinophyceae 40 10
Eukaryota;Alveolata;Unclassified Alveolata 10 20
Eukaryota;Choanoflagellida;Acanthoecidae 10 20
Eukaryota;Choanoflagellida;Codonosigidae 20 40
I'm new to R and I tried the below R code but it didn't give me the expected result. I would appreciate it very much if anyone could provide me the correct R code to do this or if there's an alternative simple solution on bash for this.
df <- read.table("abundance_table", header= TRUE, sep = "\t")
sum= colSums(df[,-1])
norm = df[,-1] / sum*100
Upvotes: 0
Views: 1897
Reputation: 12887
Using awk as an alternative, processing the file twice:
awk 'NR==FNR { tot1+=$(NF-1);tot2+=$NF;next } NR!=FNR && FNR == 1 { print } NR!=FNR && FNR != 1 { for (i=1;i<NF-1;i++) { printf "%s ",$i } printf "%s %s\n",($(NF-1)/tot1)*100,($(NF)/tot2)*100 }' file file
Explanation:
awk 'NR==FNR { # On the first process of the file
tot1+=$(NF-1); # Create a variable with a running total of the last but one field
tot2+=$NF; # Create a variable with a running total of the last field
next
}
NR!=FNR && FNR == 1 { Process on the second pass and there the line/record is the first
print # Print the line (headers)
}
NR!=FNR && FNR != 1 { # Second pass of the file and none headers
for (i=1;i<NF-1;i++) {
printf "%s ",$i # Loop through the text field printing
}
printf "%s %s\n",($(NF-1)/tot1)*100,($(NF)/tot2)*100 # Print the calculated fields (utilising totals)
}' taxa taxa
Upvotes: 1
Reputation: 389175
Here are 3 base R solutions :
#1.
df[-1] <-sweep(df[-1], 2, colSums(df[,-1]), `/`) * 100
#2.
df[-1] <- t(t(df[-1])/colSums(df[,-1])) * 100
#3.
df[-1] <- sapply(df[-1], prop.table) * 100
All of which return :
df
# Taxa Sample1 Sample2
#1 Eukaryota;Alveolata;Apicomplexa 20 10
#2 Eukaryota;Alveolata;Dinophyceae 40 10
#3 Eukaryota;Alveolata;UnclassifiedAlveolata 10 20
#4 Eukaryota;Choanoflagellida;Acanthoecidae 10 20
#5 Eukaryota;Choanoflagellida;Codonosigidae 20 40
Upvotes: 1