Reputation: 3
I searched this topic and could not find. I have 5593 multi fasta files and I need to count how many time each species id occurs in each file. I can only identify the the total number of sequences in each species, but i can't identify the input files.
Input
file1.fasta:
>hsa
ATCGATCGATCAGACTACG
>eco
ATCGATCGATCAGACTACG
file2.fasta:
>hsa
GATCGATCAGACTACGAAA
>hsa
GATCGATCACAGACTACGAAA
file3.fasta:
>hsa
CTAGACTAGATAGACACATAGAGA
>ecj
CTAGACTAGCTAGACCCATAGAGA
>mmu
CTAGACAAGATAGACACAAAGAGA
>eco
CTAGACTACATCGACACATAGAGA
Expected output
file1.fasta >hsa [count]
file1.fasta >eco [count]
file2.fasta >hsa [count]
file3.fasta >hsa [count]
file3.fasta >ecj [count]
file3.fasta >mmu [count]
file3.fasta >eco [count]
<!-
awk /^>.../ {print $1} *.* | sort | uniq -c | sort -nr
Ouput
[total counts]>hsa
[total counts]>eco
[total counts]>mmu
[total counts]>ecj
Upvotes: 0
Views: 405
Reputation: 77115
Here is one way using awk
:
awk '
BEGIN { SUBSEP = FS }
!(FILENAME in name) { fileorder[++num]=FILENAME; name[FILENAME]++ }
/^>/ { count[FILENAME,$1]++ }
END {
for (order=1; order<=num; order++) {
for (total in count) {
split(total, keys, SUBSEP)
if (fileorder[order]==keys[1]) {
print fileorder[order], keys[2], count[fileorder[order],keys[2]]
}
}
}
}' file1 file2 file3
Output:
file1 >eco 1
file1 >hsa 1
file2 >hsa 2
file3 >eco 1
file3 >mmu 1
file3 >hsa 1
file3 >ecj 1
BEGIN
block we set SUBSEP
to FS
. This allows the composite keys to be separated by a space.name
and fileorder
. If our file is not present in name
array we populate the fileorder
array with the filename. count
which has the filename and pattern as composite key and counts as value.END
block, we iterate through our fileorder
array and then our count
array. We split the composite key using split
function and check if filename is matching first part of our key. If it is we print the filename, pattern and count and move on to the next iteration.Upvotes: 1
Reputation: 754400
Assuming that the 'species' lines start with >
, and that the square bracketed expressions in the sample output are simple numbers, then:
awk 'BEGIN { SUBSEP = " " }
/^>/ { per_file[FILENAME,$1]++; total[$1]++ }
END { for (k in per_file) print k, per_file[k]
for (k in total) print total[k], k
}' *.fasta
You'll probably need to do some sorting somewhere along the line, either in awk
or afterwards, as there's no guarantee that the data presented by a for (index in array)
loop will be in any particular order.
Without the BEGIN
block (or other mechanism) setting SUBSEP
, there would be a \034
character after the filename and before the species key. By setting SUBSEP
to a blank, the filename is separated from the species key by a blank.
Upvotes: 1