juk4
juk4

Reputation: 3

Using awk to count how many time each species id occurs in multi fasta files

I searched this topic and could not find. I have 5593 multi fasta files and I need to count how many time each species id occurs in each file. I can only identify the the total number of sequences in each species, but i can't identify the input files.

Input

file1.fasta:

>hsa
ATCGATCGATCAGACTACG

>eco
ATCGATCGATCAGACTACG

file2.fasta:

>hsa
GATCGATCAGACTACGAAA

>hsa
GATCGATCACAGACTACGAAA

file3.fasta:

>hsa
CTAGACTAGATAGACACATAGAGA

>ecj
CTAGACTAGCTAGACCCATAGAGA

>mmu
CTAGACAAGATAGACACAAAGAGA

>eco
CTAGACTACATCGACACATAGAGA

Expected output

file1.fasta >hsa [count]
file1.fasta >eco [count]
file2.fasta >hsa [count]
file3.fasta >hsa [count]

file3.fasta >ecj [count]
file3.fasta >mmu [count]
file3.fasta >eco [count]

<!-

awk /^>.../ {print $1} *.* | sort | uniq -c | sort -nr

Ouput

[total counts]>hsa

[total counts]>eco

[total counts]>mmu

[total counts]>ecj

Upvotes: 0

Views: 405

Answers (2)

jaypal singh
jaypal singh

Reputation: 77115

Here is one way using awk:

awk '
BEGIN { SUBSEP = FS }
!(FILENAME in name) { fileorder[++num]=FILENAME; name[FILENAME]++ }
/^>/ { count[FILENAME,$1]++ }
END { 
  for (order=1; order<=num; order++) {
      for (total in count) {
          split(total, keys, SUBSEP)
          if (fileorder[order]==keys[1]) {
              print fileorder[order], keys[2], count[fileorder[order],keys[2]]
          }
      }
  }
}' file1 file2 file3

Output:

file1 >eco 1
file1 >hsa 1
file2 >hsa 2
file3 >eco 1
file3 >mmu 1
file3 >hsa 1
file3 >ecj 1
  • In the BEGIN block we set SUBSEP to FS. This allows the composite keys to be separated by a space.
  • We maintain two arrays, name and fileorder. If our file is not present in name array we populate the fileorder array with the filename.
  • For the lines starting with fasta pattern we populate another array called count which has the filename and pattern as composite key and counts as value.
  • In the END block, we iterate through our fileorder array and then our count array. We split the composite key using split function and check if filename is matching first part of our key. If it is we print the filename, pattern and count and move on to the next iteration.

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 754400

Assuming that the 'species' lines start with >, and that the square bracketed expressions in the sample output are simple numbers, then:

awk 'BEGIN { SUBSEP = " " }
     /^>/ { per_file[FILENAME,$1]++; total[$1]++ }
     END { for (k in per_file) print k, per_file[k]
           for (k in total)    print total[k], k
         }' *.fasta

You'll probably need to do some sorting somewhere along the line, either in awk or afterwards, as there's no guarantee that the data presented by a for (index in array) loop will be in any particular order.

Without the BEGIN block (or other mechanism) setting SUBSEP, there would be a \034 character after the filename and before the species key. By setting SUBSEP to a blank, the filename is separated from the species key by a blank.

Upvotes: 1

Related Questions