Annelisa
Annelisa

Reputation: 45

Count reads in fasta file (loop)

I would like to count the number of reads in a fasta file using awk. A read in a fasta file starts with "> NAME" followed with a line and "DNA code". In my case, fasta files contain at least 1 read (thus not empty). I also have a lot of fasta files and I would like to loop this. Therefore I would also like to copy the name of the file into the output and paste the number of reads next to the name of the file in the output file.

Fasta file example (File 1.fasta)

>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg

Fasta file example (File 2.fasta)

>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg
>sequence C
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg

I've tried multiple scripts already

Script 1

#!/bin/bash
for file1 in ~/Test/*.fasta
do
outfile1=${file1}_readcount.txt
awk -F ' ' -v out1=$outfile1 '{
if(NR==1) {lines[0]=lines[0] OFS FILENAME  > out1;
}
if(FNR==NR) {grep -c "^>" $file1 > out1;
}
}'  $file1
done

It didn't give an error but also no output

Script 2

awk '
BEGIN   { OFS="\t" } #output field delimiter is tab

FNR==1  { lines[0]=lines[0] OFS FILENAME } #append filename to header record

FNR==NR {grep -c "^>" FILENAME } # counts number of ">" at the beginning of lines

END     { for (i=0;i<=FNR;i++) #loop through the line numbers
              print lines[i]
        } #printing each line 
' *fasta > countreads.txt

Here I only got the header in the file and thousands of empty rows.

Expected ouput that I would like to get

File1   2
File2   3

Upvotes: 2

Views: 364

Answers (3)

Ed Morton
Ed Morton

Reputation: 203368

Using GNU awk for ENDFILE and gensub():

$ awk -v OFS='\t' '/^>/{cnt++} ENDFILE{print gensub(/\.fasta$/,"",1,FILENAME), cnt+0; cnt=0}' File*.fasta
File1   2
File2   3

Upvotes: 0

jhnc
jhnc

Reputation: 16662

With GNU utilities you could do:

grep -cZ '^>' ~/Test/*.fasta | tr '\0' '\t' >countreads.txt
  • -Z prints a null after the filename instead of a colon
  • tr converts that into a tab

If you know your filenames don't contain colons, this should work with any version:

grep -c '^>' ~/Test/*.fasta | tr : '\t' >countreads.txt

Upvotes: 1

dawg
dawg

Reputation: 103814

If you mean to count each /^>/ as a sequence and then summarize sequence counts by filenames, you can do this in awk:

awk 'FNR==1{if (name) print name, cnt; name=FILENAME; cnt=0}
/^>/{cnt++}
END{print name, cnt}' *.fasta

This also works but the file names may be in a different order than awk read them:

awk '/^>/{file_cnt[FILENAME]++}
END{for (fn in file_cnt) print fn, file_cnt[fn]}' *.fasta

You can also just use grep directly to count the matches:

grep -c "^>" *.fasta # that will be ':' delimited. 
                     # You can use sed, tr, awk to 
                     # change the ':' to a '\t` or whatever.

In summary, use awk if you want to customize what is counted or printed. Use grep with a glob if you just want to count the total regex matches and sumarize by file name.

Upvotes: 3

Related Questions