Rodriguez J Mathew
Rodriguez J Mathew

Reputation: 107

How to transform raw counts in a table to percent relative abundance on R or bash?

I'm trying to transform the counts in each cell of the below table (filename=abundance_table) to percent relative abundance by dividing the total sum of counts in each column (with the exception of the first) and then multiplying it by 100.

Taxa    Sample1    Sample2
Eukaryota;Alveolata;Apicomplexa 1000    500
Eukaryota;Alveolata;Dinophyceae 2000    500
Eukaryota;Alveolata;Unclassified Alveolata  500 1000
Eukaryota;Choanoflagellida;Acanthoecidae        500  1000
Eukaryota;Choanoflagellida;Codonosigidae        1000     2000

and I'm expecting an output table that would look exactly as below:

Taxa    Sample1    Sample2
Eukaryota;Alveolata;Apicomplexa 20  10
Eukaryota;Alveolata;Dinophyceae 40   10
Eukaryota;Alveolata;Unclassified Alveolata  10  20
Eukaryota;Choanoflagellida;Acanthoecidae        10  20
Eukaryota;Choanoflagellida;Codonosigidae        20     40

I'm new to R and I tried the below R code but it didn't give me the expected result. I would appreciate it very much if anyone could provide me the correct R code to do this or if there's an alternative simple solution on bash for this.

df <- read.table("abundance_table", header= TRUE, sep = "\t")
sum= colSums(df[,-1])
norm = df[,-1] / sum*100

Upvotes: 0

Views: 1897

Answers (2)

Raman Sailopal
Raman Sailopal

Reputation: 12887

Using awk as an alternative, processing the file twice:

awk 'NR==FNR { tot1+=$(NF-1);tot2+=$NF;next } NR!=FNR && FNR == 1 { print } NR!=FNR && FNR != 1 { for (i=1;i<NF-1;i++) { printf "%s ",$i } printf "%s %s\n",($(NF-1)/tot1)*100,($(NF)/tot2)*100 }' file file

Explanation:

awk 'NR==FNR { # On the first process of the file
              tot1+=$(NF-1); # Create a variable with a running total of the last but one field
              tot2+=$NF; # Create a variable with a running total of the last field
              next 
             } 
     NR!=FNR && FNR == 1 { Process on the second pass and there the line/record is the first
              print # Print the line (headers)
             }
     NR!=FNR && FNR != 1 { # Second pass of the file and none headers
              for (i=1;i<NF-1;i++) { 
                printf "%s ",$i # Loop through the text field printing
              } 
              printf "%s %s\n",($(NF-1)/tot1)*100,($(NF)/tot2)*100 # Print the calculated fields (utilising totals)
             }' taxa taxa

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389175

Here are 3 base R solutions :

#1.
df[-1] <-sweep(df[-1], 2, colSums(df[,-1]), `/`) * 100

#2.
df[-1] <- t(t(df[-1])/colSums(df[,-1])) * 100

#3.
df[-1] <- sapply(df[-1], prop.table) * 100

All of which return :

df
#                                       Taxa Sample1 Sample2
#1           Eukaryota;Alveolata;Apicomplexa      20      10
#2           Eukaryota;Alveolata;Dinophyceae      40      10
#3 Eukaryota;Alveolata;UnclassifiedAlveolata      10      20
#4  Eukaryota;Choanoflagellida;Acanthoecidae      10      20
#5  Eukaryota;Choanoflagellida;Codonosigidae      20      40

Upvotes: 1

Related Questions