Reputation: 343

Word count for sequence length is wrong

I have a fasta file that looks like this :

>0011 my.header
CAAGTTTATCCACATAATGCGAATAACCAATAATCCTTTTCATAAGTCTATTCTTCATAATCTAAATCGT
TTTCAAGTACATAATTATCCTTTGCCTGTTCGTTAGTTTTATTAAAATTATACTGATCTTTCTTTTTCAT
CCCACGGGTTAAAATCTTCCTCAATCGGTGGGTTTTCTTCATGAAATTGTTTCATTTATTTGCTGTTTTT
AGTTCTCCGATTGTATAACACTTAGTTGTATTAGTGCCGGGTAGTCTATAATTAGCCTCTTTTATATACC
CACGCTTTAATAATCTGTTTACAGAATTATATAATTTGCTCTTAGACATAAAAGGAATAATTTCTCTAAG
TTTAGAAATCGTAATAAAAACGGTATTAGGTTCTTTCTTTACCCTACATCCCTTAAACTTATCCTTATAT
GTATCAGTACAAAGTATAAGAAACATAACTGAATATACTACTGAATCATCTAAACCGATTTCTTTTGCTA
AATCTTCATTTATAACCATAATTATAACGCTTTTAATTGAATTGACTCTTTAACATTTGATGTTTTAACG
AACTGATCGTATATTTCCGGATATTGTTCTTTCAGTGCTTTAGAATCAAGTGATTCACGGCTATACGCTT
TCTTCCTTGTGACTGAAATAAGTTCCCCTTTTATATTATCAGCTTTCGCCTCAGACATCAGACCTAACAA
CTGTTCTTTGAACTTGCCTAAATGTTCGTCTATCTTCTTTTGCATTTCAAGAAGTTCGTAAACGCCTTCT
TCGATATGTGCAACCTTTGCAGGCAACGACTCCAATTTAGCTACATAACTGTCTTTGCTTGCATTGTCTG
CATATCGAACTCCATTCTTACAGCAATTAAGGAATAATTCTATTTCGCTGTCCGGTATGCGTTCAACAGA
GAAAATTCCGTCCTTATCCTTGTCACCTCTTAGCCAAATTGCGATAAGTCCCTCTACTTTCAAATTTGGG
TTTTGTCTCTCGAAAAGATAGGCGTATATTGATAGCTGCCAAGACAAATAAAGCAAATCAAGTTTGTAGG
TAGTTTTAATGTCACCTAAAACGACTGATTTATCAGAGCTGCCCAAATATACTTTATCGGTCGGTGATGC
GATAAGCTCGTTATCAGTTAGAATATACTCAGATGCGATATGAATTAAACCGCTTCCGGCTTTTAAATTC
AAATAGTTCTCTCCGTAGACCGTTTCCGGTTCAATACCTTCTTTGTCGATCCTCTCAACTTCATCATGAA
CCGCTTTCCCTCTCTCAGTTGCCGATCTCAAAATATTATCCGGTATATTGTCAAGTTTGCCTGGAAATAA

and I want the length of the sequence (without the header). I tried this:

tail -n +2 my.file | wc -c

which gives me this output:

which is wrong, the real size is 1330.

I'm not sure what's going on. I'm thinking there's probably some sort of hidden characters but I don't know how to explore this.

Upvotes: 7

Answers (5)

anubhava

Reputation: 785631

It is because wc is counting all the line breaks as well.

You may use awk to get this done:

awk 'NR>1{s+=length()} END{print s}' my.file

You may also use tail | tr | wc:

tail -n +2 my.file | tr -d '\n' | wc -c
1330

Upvotes: 19

user unknown

Reputation: 36250

Subtract the line count from the chars after removing the header:

tail -n +2  fasta.file | wc -lc | awk '{print $2-$1}'

Upvotes: 1

kyodev

Reputation: 583

bash only, in a script, we have to talk about programming ;o)

tk="$(<my.file)"      # file in variable
tk="${tk#>*$'\n'}"    # suppression header '>...first\n'
tk="${tk//$'\n'}"     # suppression all \n

echo ": ${#tk}"       # 1330  \o/

Upvotes: 2

RavinderSingh13

Reputation: 133650

EDIT: Adding 1 more solution of awk here too.

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");gsub(/ /,"");print length($0)}'  Input_file

awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");print length($0)}' OFS=""  Input_file

awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}'  Input_file

Following awk may help you on same.

awk '!/^>/{sum+=length($0)} END{print "Length is:" sum}'  Input_file

Upvotes: 3

glenn jackman

Reputation: 247012

perl:

perl -0777 -nE 's/^>.*$//m; say tr/A-Z/A-Z/' file

That reads the file into a single string, removes the first line, and counts the letters.

Upvotes: 2

Word count for sequence length is wrong

Answers (5)

Related Questions