Reputation: 137

Count specific character per every species of a Fasta file

I have been trying to find the amount of 1s per each species in a fasta file that looks like this:

The desired output would be:

I know how to get the numbers of 1s in a file with:

grep -c 1 file

My problem is that I cannot find the way to keep track of the number of 1s per each species (instead of the total in the file).

Upvotes: 2

Answers (9)

ufopilot

Reputation: 3985

$ awk '!/>/{split($0,a,"");$0=0;for(i in a)$0+=a[i]}1' fasta.txt
>111
5
>102
4

$ awk '!/>/{for(i=sum=0;i++<length;)sum+=substr($0,i,1);$0=sum}1' fasta.txt
>111
5
>102
4

Upvotes: 1

RARE Kpop Manifesto

Reputation: 2845

mawk -F1 '$ !NF=NF - NF^/^[>]/'
gawk -F1 '$_=NF-NF ^/^>/'      # that's the most succinct
                               # vers. I could conjure up 

>111
1100101010
>102
1110000001

5
4

If you really want the other 2 rows ::

 gawk -F1 '/^>/ || $_=--NF'
 mawk -F1 '/^>/ || $!_=$_=--NF' 

>111
1100101010
>102
1110000001
 
     1  >111
     2  5
     3  >102
     4  4

And if you need the multiline ones :

gawk '$_ = sprintf("%.*s%.*s\n%.f",/^[^>]/,">", (__= index($_,ORS) ) - FS, $_,
      NF-FS-substr("",__=substr($_,FS,__))-gsub(FS,"",__))' RS='\n[>]' FS=1

>111                                                                 
9

Upvotes: 1

Paul Hodges

Reputation: 15358

To handle multiple lines of data on one header under a single instance of awk -

$: cat fasta.txt
>101
1100101010
1111111010
1100000000
>102
1110000001
>103
1100000000
1110000001


$: awk '/^>/{if(NR>1){print cnt;} print; cnt=0;} /^[01]/{ cnt+=gsub(/1/,""); } END{print cnt;}' fasta.txt
>101
15
>102
4
>103
6

Not as elegant as other versions here, but maybe easier to read and understand. YMMV.

Upvotes: 2

anubhava

Reputation: 785376

Here is gnu-awk solution that would work with multiline record as well:

cat file
>111
11001
01010
>102
1110000001

awk -v RS='>[0-9]+\n' 'NF {printf ORS "%s\n", gsub(/1/, "&")} {ORS=RT}' file
>111
5
>102
4

Upvotes: 3

j_b

Reputation: 2020

Assuming your fasta is formatted as you indicate, and assuming using awk would be acceptable, then the following might work:

while read -r one ; do 
    echo "${one}"
    read -r two
    awk -F"1" '{print NF-1}' <<< "${two}"
done <fasta.txt

(Note: The awk command is splitting the string by '1' and then printing the number of resulting fields minus 1)

fasta.txt:

Output:

Per @ikegami, if records are spread over multiple lines:

#!/bin/bash

fasta_file="${1:-fasta2.txt}"

while read -r line ; do 
    if [[ "$line" =~ ^\>.* ]]; then
        [[ -n "${cnt}" ]] && echo "${cnt}"
        cnt=0
        echo "${line}"
    else
       ((cnt += $(awk -F"1" '{print NF-1}' <<< "${line}") ))
    fi
done <"${fasta_file}"

[[ -n "${cnt}" ]] && echo "${cnt}"

Upvotes: 4

Ted Lyngmo

Reputation: 117408

grep -c 1 will give you the number of matching lines, not the total number of 1s. You could use grep -o to make it print only the matching parts of each matching line on a separate line each and then wc -l to count the number of lines.

while read -r line
do
    if [[ ${line:0:1} == '>' ]]; then
        if [[ -n $count ]]; then
            printf "%d\n" $count
        fi
        count=0
        echo "$line"
    else
        ((count += $(grep -o 1 <<< "$line" | wc -l)))
    fi
done < fasta_file

if [[ -n $count ]]; then
    printf "%d\n" $count
fi

Or in pure bash using parameter expansion:

while read -r line
do
    if [[ ${line:0:1} == '>' ]]; then
        if [[ -n $count ]]; then
            printf "%d\n" $count
        fi
        count=0
        echo "$line"
    else
        line="${line//[^1]/}"   # remove everything but 1's
        ((count += ${#line}))   # add the length of line to count
    fi
done < fasta_file

if [[ -n $count ]]; then
    printf "%d\n" $count
fi

A similar setup in perl:

open my $fh, 'fasta_file' or die "$!";

my $count=-1;

while(<$fh>) {
    if(/^>/) {
        print "$count\n" unless($count == -1);
        $count = 0;
        print;
    } else {
        $count += tr/1//;
    }
}
print "$count\n" unless($count == -1);

close $fh;

The transliteration operator, tr///, will return how many transliterations it performed and since 1 is the only argument, it'll be the same as counting 1's.

Upvotes: 7

ikegami

Reputation: 386216

>111
11001010101110000001

can also be written as

>111
1100101010
1110000001

but none of the existing solutions work for the latter. This addresses that oversight:

perl -Mv5.10 -ne'
   if ( /^>/ ) {
      say $c if defined $c;
      $c = 0;
      print;
   } else {
      $c += tr/1//;
   }
   END {
      say $c if defined $c;
   }
' file.fasta

For both files show above, the program outputs

>111
9

Upvotes: 6

RavinderSingh13

Reputation: 133590

With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.

awk '/^>/{print;next} {print (NF?split($0,arr,"1")-1:0)}' Input_file

Explanation: Simple explanation would be, checking condition if line starts from > then print that line and next will skip all further statements from here. Then using print function where checking if NF is NOT NULL then using split function to split current line into array arr with delimiter of 1(which will provide number of 1s present in current line, doing -1 will give accurate count), ELSE NF is NOT NULL then print 0(for EMPTY lines).

Upvotes: 1

markp-fuso

Reputation: 34808

One awk idea:

awk '
/^>/ { print ; next }              # print lines starting with ">"; skip to next input line
     { print gsub(/1/,"x") }       # replace all "1" characters with dummy "x"; gsub() returns number of replacements (ie, number of "1" characters in the line)
' file

Or as a one-liner:

awk '/^>/ {print;next} {print gsub(/1/,"x")}' file

Collapsing into a ternary operator to determine what to print:

awk '{print ($0 ~ /^>/ ? $0 : gsub(/1/,"x"))}' file

These all generate:

Upvotes: 5

Count specific character per every species of a Fasta file

Answers (9)

Related Questions