Reputation: 5

How to count the total number of residues in a sequence with awk?

I have a text file that contains protein sequences. I would like to get the total number of residues in each sequence. How can I do this with awk?

>1GS9
PYCPAAVIAPVV
>1LE2
DFEFAKWKN
>1NFN
ADAPPDS

Desired output

1GS9 - 12
1LE2 - 9
1NFN - 7

Upvotes: 0

Answers (5)

potong

Reputation: 58528

This might work for you (GNU awk):

awk -vRS='>' -vOFS=' - ' 'NR>1{print $1,length($2)}' file

Upvotes: 0

Vijay

Reputation: 67291

awk '{line=substr($0,2);getline;print line,"-",length($0)}' temp

Tested below:

> cat temp
>1GS9
PYCPAAVIAPVV
>1LE2
DFEFAKWKN
>1NFN
ADAPPDS
> awk '{line=substr($0,2);getline;print line,"-",length($0)}' temp
1GS9 - 12
1LE2 - 9
1NFN - 7
>

Upvotes: 0

Birei

Reputation: 36282

Read every odd line automatically with { ... } and proteins in even lines with getline:

awk ' {
    getline prot;
    printf "%s - %d\n", substr( $0, 2 ), length( prot ) 
}' infile

Output:

1GS9 - 12
1LE2 - 9
1NFN - 7

Upvotes: 0

tzelleke

Reputation: 15345

awk '/^>/ {
   name=substr($0,2);
   getline;
   printf("%s - %d\n", name, length($1))
}' input_file

Upvotes: 1

beny23

Reputation: 35048

You could do this:

 awk '/^>/ { res=substr($0, 2); } /^[^>]/ { print res " - " length($0); }' < file

Upvotes: 0

How to count the total number of residues in a sequence with awk?

Answers (5)

Related Questions