Reputation: 23
I have a fasta file which contains protein sequences. How can I count the number of residues in each sequence with awk?
>seq1
PESDFA
>seq2
>seq3
GFCSSWWR
Desired Output
seq1 6
seq2 0
seq3 8
Upvotes: 0
Views: 785
Reputation: 204456
$ awk -F'>' '
NF==2 { seq=$2; lgth[seq]=0; next }
{ lgth[seq]=length($0) }
END { for (seq in lgth) print seq, lgth[seq] }
' file
seq1 6
seq2 0
seq3 8
If you care about the order of output, just keep a separate array of seq
values
$ awk -F'>' '
NF==2 { seq=$2; seqs[++numSeqs]=seq; next}
{ lgth[seq]=length($0) }
END { for (i=1; i<=numSeqs; i++) print seqs[i], lgth[seqs[i]]+0 }
' file
seq1 6
seq2 0
seq3 8
Upvotes: 0
Reputation: 195229
this line is not nice, but works for your example:
kent$ paste f <(sed '1d' f)|awk '/^>/{print $1, ($2~/^>/?0:length($2))}'
>seq1 6
>seq2 0
>seq3 8
Upvotes: 1
Reputation: 41460
This awk
gives you some:
awk -v FS="" '!/^>/ {print f,NF} {f=substr($0,2)}' file
seq1 6
seq3 8
To get seq2
you can do this:
awk '{printf (/^>/&&NR>1?RS:"")"%s ",$0} END {print ""}' file | awk '{print substr($1,2),length($2)}'
seq1 6
seq2 0
seq3 8
First part change all seq
and info to one line, next part gives the length.
Upvotes: 0