Reputation: 343
I have a fasta file that looks like this :
>0011 my.header
CAAGTTTATCCACATAATGCGAATAACCAATAATCCTTTTCATAAGTCTATTCTTCATAATCTAAATCGT
TTTCAAGTACATAATTATCCTTTGCCTGTTCGTTAGTTTTATTAAAATTATACTGATCTTTCTTTTTCAT
CCCACGGGTTAAAATCTTCCTCAATCGGTGGGTTTTCTTCATGAAATTGTTTCATTTATTTGCTGTTTTT
AGTTCTCCGATTGTATAACACTTAGTTGTATTAGTGCCGGGTAGTCTATAATTAGCCTCTTTTATATACC
CACGCTTTAATAATCTGTTTACAGAATTATATAATTTGCTCTTAGACATAAAAGGAATAATTTCTCTAAG
TTTAGAAATCGTAATAAAAACGGTATTAGGTTCTTTCTTTACCCTACATCCCTTAAACTTATCCTTATAT
GTATCAGTACAAAGTATAAGAAACATAACTGAATATACTACTGAATCATCTAAACCGATTTCTTTTGCTA
AATCTTCATTTATAACCATAATTATAACGCTTTTAATTGAATTGACTCTTTAACATTTGATGTTTTAACG
AACTGATCGTATATTTCCGGATATTGTTCTTTCAGTGCTTTAGAATCAAGTGATTCACGGCTATACGCTT
TCTTCCTTGTGACTGAAATAAGTTCCCCTTTTATATTATCAGCTTTCGCCTCAGACATCAGACCTAACAA
CTGTTCTTTGAACTTGCCTAAATGTTCGTCTATCTTCTTTTGCATTTCAAGAAGTTCGTAAACGCCTTCT
TCGATATGTGCAACCTTTGCAGGCAACGACTCCAATTTAGCTACATAACTGTCTTTGCTTGCATTGTCTG
CATATCGAACTCCATTCTTACAGCAATTAAGGAATAATTCTATTTCGCTGTCCGGTATGCGTTCAACAGA
GAAAATTCCGTCCTTATCCTTGTCACCTCTTAGCCAAATTGCGATAAGTCCCTCTACTTTCAAATTTGGG
TTTTGTCTCTCGAAAAGATAGGCGTATATTGATAGCTGCCAAGACAAATAAAGCAAATCAAGTTTGTAGG
TAGTTTTAATGTCACCTAAAACGACTGATTTATCAGAGCTGCCCAAATATACTTTATCGGTCGGTGATGC
GATAAGCTCGTTATCAGTTAGAATATACTCAGATGCGATATGAATTAAACCGCTTCCGGCTTTTAAATTC
AAATAGTTCTCTCCGTAGACCGTTTCCGGTTCAATACCTTCTTTGTCGATCCTCTCAACTTCATCATGAA
CCGCTTTCCCTCTCTCAGTTGCCGATCTCAAAATATTATCCGGTATATTGTCAAGTTTGCCTGGAAATAA
and I want the length of the sequence (without the header). I tried this:
tail -n +2 my.file | wc -c
which gives me this output:
1349
which is wrong, the real size is 1330.
I'm not sure what's going on. I'm thinking there's probably some sort of hidden characters but I don't know how to explore this.
Upvotes: 7
Views: 2041
Reputation: 785631
It is because wc
is counting all the line breaks as well.
You may use awk
to get this done:
awk 'NR>1{s+=length()} END{print s}' my.file
1330
You may also use tail | tr | wc
:
tail -n +2 my.file | tr -d '\n' | wc -c
1330
Upvotes: 19
Reputation: 36250
Subtract the line count from the chars after removing the header:
tail -n +2 fasta.file | wc -lc | awk '{print $2-$1}'
Upvotes: 1
Reputation: 583
bash only, in a script, we have to talk about programming ;o)
tk="$(<my.file)" # file in variable
tk="${tk#>*$'\n'}" # suppression header '>...first\n'
tk="${tk//$'\n'}" # suppression all \n
echo ": ${#tk}" # 1330 \o/
Upvotes: 2
Reputation: 133650
EDIT: Adding 1 more solution of awk
here too.
awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");gsub(/ /,"");print length($0)}' Input_file
OR
awk -v RS="" -v FS="\n" '{$1="";sub(/^ +/,"");print length($0)}' OFS="" Input_file
OR
awk -v RS= '{gsub(/^[^\n]*|\n/, ""); print length()}' Input_file
Following awk
may help you on same.
awk '!/^>/{sum+=length($0)} END{print "Length is:" sum}' Input_file
Upvotes: 3
Reputation: 247012
perl:
perl -0777 -nE 's/^>.*$//m; say tr/A-Z/A-Z/' file
That reads the file into a single string, removes the first line, and counts the letters.
Upvotes: 2