Reputation: 137
I have been trying to find the amount of 1s per each species in a fasta file that looks like this:
>111
1100101010
>102
1110000001
The desired output would be:
>111
5
>102
4
I know how to get the numbers of 1s in a file with:
grep -c 1 file
My problem is that I cannot find the way to keep track of the number of 1s per each species (instead of the total in the file).
Upvotes: 2
Views: 693
Reputation: 3985
$ awk '!/>/{split($0,a,"");$0=0;for(i in a)$0+=a[i]}1' fasta.txt
>111
5
>102
4
$ awk '!/>/{for(i=sum=0;i++<length;)sum+=substr($0,i,1);$0=sum}1' fasta.txt
>111
5
>102
4
Upvotes: 1
Reputation: 2845
mawk -F1 '$ !NF=NF - NF^/^[>]/'
gawk -F1 '$_=NF-NF ^/^>/' # that's the most succinct
# vers. I could conjure up
>111
1100101010
>102
1110000001
5
4
If you really want the other 2 rows ::
gawk -F1 '/^>/ || $_=--NF'
mawk -F1 '/^>/ || $!_=$_=--NF'
>111
1100101010
>102
1110000001
1 >111
2 5
3 >102
4 4
And if you need the multiline ones :
gawk '$_ = sprintf("%.*s%.*s\n%.f",/^[^>]/,">", (__= index($_,ORS) ) - FS, $_,
NF-FS-substr("",__=substr($_,FS,__))-gsub(FS,"",__))' RS='\n[>]' FS=1
|
>111
9
Upvotes: 1
Reputation: 15358
To handle multiple lines of data on one header under a single instance of awk
-
$: cat fasta.txt
>101
1100101010
1111111010
1100000000
>102
1110000001
>103
1100000000
1110000001
$: awk '/^>/{if(NR>1){print cnt;} print; cnt=0;} /^[01]/{ cnt+=gsub(/1/,""); } END{print cnt;}' fasta.txt
>101
15
>102
4
>103
6
Not as elegant as other versions here, but maybe easier to read and understand. YMMV.
Upvotes: 2
Reputation: 785376
Here is gnu-awk
solution that would work with multiline record as well:
cat file
>111
11001
01010
>102
1110000001
awk -v RS='>[0-9]+\n' 'NF {printf ORS "%s\n", gsub(/1/, "&")} {ORS=RT}' file
>111
5
>102
4
Upvotes: 3
Reputation: 2020
Assuming your fasta is formatted as you indicate, and assuming using awk
would be acceptable, then the following might work:
while read -r one ; do
echo "${one}"
read -r two
awk -F"1" '{print NF-1}' <<< "${two}"
done <fasta.txt
(Note: The awk command is splitting the string by '1' and then printing the number of resulting fields minus 1)
fasta.txt:
>111
1100101010
>102
1110000001
Output:
>111
5
>102
4
Per @ikegami, if records are spread over multiple lines:
#!/bin/bash
fasta_file="${1:-fasta2.txt}"
while read -r line ; do
if [[ "$line" =~ ^\>.* ]]; then
[[ -n "${cnt}" ]] && echo "${cnt}"
cnt=0
echo "${line}"
else
((cnt += $(awk -F"1" '{print NF-1}' <<< "${line}") ))
fi
done <"${fasta_file}"
[[ -n "${cnt}" ]] && echo "${cnt}"
Upvotes: 4
Reputation: 117408
grep -c 1
will give you the number of matching lines, not the total number of 1
s. You could use grep -o
to make it print only the matching parts of each matching line on a separate line each and then wc -l
to count the number of lines.
while read -r line
do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $count ]]; then
printf "%d\n" $count
fi
count=0
echo "$line"
else
((count += $(grep -o 1 <<< "$line" | wc -l)))
fi
done < fasta_file
if [[ -n $count ]]; then
printf "%d\n" $count
fi
Or in pure bash using parameter expansion:
while read -r line
do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $count ]]; then
printf "%d\n" $count
fi
count=0
echo "$line"
else
line="${line//[^1]/}" # remove everything but 1's
((count += ${#line})) # add the length of line to count
fi
done < fasta_file
if [[ -n $count ]]; then
printf "%d\n" $count
fi
A similar setup in perl:
open my $fh, 'fasta_file' or die "$!";
my $count=-1;
while(<$fh>) {
if(/^>/) {
print "$count\n" unless($count == -1);
$count = 0;
print;
} else {
$count += tr/1//;
}
}
print "$count\n" unless($count == -1);
close $fh;
The transliteration operator, tr///
, will return how many transliterations it performed and since 1
is the only argument, it'll be the same as counting 1
's.
Upvotes: 7
Reputation: 386216
>111
11001010101110000001
can also be written as
>111
1100101010
1110000001
but none of the existing solutions work for the latter. This addresses that oversight:
perl -Mv5.10 -ne'
if ( /^>/ ) {
say $c if defined $c;
$c = 0;
print;
} else {
$c += tr/1//;
}
END {
say $c if defined $c;
}
' file.fasta
For both files show above, the program outputs
>111
9
Upvotes: 6
Reputation: 133590
With your shown samples, please try following awk
code. Written and tested in GNU awk
, should work in any awk
.
awk '/^>/{print;next} {print (NF?split($0,arr,"1")-1:0)}' Input_file
Explanation: Simple explanation would be, checking condition if line starts from >
then print that line and next
will skip all further statements from here. Then using print
function where checking if NF
is NOT NULL then using split
function to split current line into array arr with delimiter of 1
(which will provide number of 1s present in current line, doing -1
will give accurate count), ELSE NF
is NOT NULL then print 0
(for EMPTY lines).
Upvotes: 1
Reputation: 34808
One awk
idea:
awk '
/^>/ { print ; next } # print lines starting with ">"; skip to next input line
{ print gsub(/1/,"x") } # replace all "1" characters with dummy "x"; gsub() returns number of replacements (ie, number of "1" characters in the line)
' file
Or as a one-liner:
awk '/^>/ {print;next} {print gsub(/1/,"x")}' file
Collapsing into a ternary operator to determine what to print
:
awk '{print ($0 ~ /^>/ ? $0 : gsub(/1/,"x"))}' file
These all generate:
>111
5
>102
4
Upvotes: 5