Using awk to count how many time each species id occurs in multi fasta files

Question

I searched this topic and could not find. I have 5593 multi fasta files and I need to count how many time each species id occurs in each file. I can only identify the the total number of sequences in each species, but i can't identify the input files.

Input

file1.fasta:

>hsa
ATCGATCGATCAGACTACG

>eco
ATCGATCGATCAGACTACG

file2.fasta:

>hsa
GATCGATCAGACTACGAAA

>hsa
GATCGATCACAGACTACGAAA

file3.fasta:

>hsa
CTAGACTAGATAGACACATAGAGA

>ecj
CTAGACTAGCTAGACCCATAGAGA

>mmu
CTAGACAAGATAGACACAAAGAGA

>eco
CTAGACTACATCGACACATAGAGA

Expected output

file1.fasta >hsa [count]
file1.fasta >eco [count]
file2.fasta >hsa [count]
file3.fasta >hsa [count]

file3.fasta >ecj [count]
file3.fasta >mmu [count]
file3.fasta >eco [count]

awk /^>.../ {print $1} *.* | sort | uniq -c | sort -nr

Ouput

[total counts]>hsa

[total counts]>eco

[total counts]>mmu

[total counts]>ecj

Jonathan Leffler · Accepted Answer

Assuming that the 'species' lines start with >, and that the square bracketed expressions in the sample output are simple numbers, then:

awk 'BEGIN { SUBSEP = " " }
     /^>/ { per_file[FILENAME,$1]++; total[$1]++ }
     END { for (k in per_file) print k, per_file[k]
           for (k in total)    print total[k], k
         }' *.fasta

You'll probably need to do some sorting somewhere along the line, either in awk or afterwards, as there's no guarantee that the data presented by a for (index in array) loop will be in any particular order.

Without the BEGIN block (or other mechanism) setting SUBSEP, there would be a \034 character after the filename and before the species key. By setting SUBSEP to a blank, the filename is separated from the species key by a blank.

Using awk to count how many time each species id occurs in multi fasta files

Answers (2)

Related Questions