In a fastq file, how do I change the sequence headers to the file name and a unique identifier?

Question

I'm working with barcoded data and I want to be able to combine the fastq files and easily be able to tell which barcode the read originally had. So I am trying to change the names of the reads to the name of the file (i.e barcode01.fastq) and append a unique number to the end. I want the final product to be something like:

> barcode01_1
AAGTAGCCTTCGTTCAGTTACGTATTG
+
(&)(*),+*)'(5<64?CA?<;A=@D6
> barcode01_2
...

So far I have the following using awk:

find . -type f -printf "/%P
" | \
    while read FILE ; 
        do 
            PREFIX=$(echo ">",${FILE##*/}"_");
            awk -v PREFIX=$PREFIX '{
                if (NR%4 == 1) 
                    print PREFIX, ++i;
                else 
                    print $0
            }' ${FILE} > ../${FILE}.fastq;
    done

This grabs all the fastq files from subdirectories and makes the headers > barcode01_ 1 but I cannot figure out how to get rid of the space. If I remove the comma between PREFIX and ++i:

find . -type f -printf "/%P
" | \
    while read FILE ; 
        do 
            PREFIX=$(echo ">",${FILE##*/}"_");
            awk -v PREFIX=$PREFIX '{
                if (NR%4 == 1) 
                    print PREFIX ++i;
                else 
                    print $0
            }' ${FILE} > ../${FILE}.fastq;
    done

This makes the headers only increasing numbers without the > barcode01_ portion.

Ed Morton · Accepted Answer

Putting ++ pushed up against the i doesn't mean awk will apply it to i instead of PREFIX. Compare:

$ awk 'BEGIN{PREFIX="foo"; print PREFIX ++i}'
0

with:

$ awk 'BEGIN{PREFIX="foo"; print PREFIX (++i)}'
foo1

There are several other issues with your script, copy/paste it into http://shellcheck.net to learn about some of them then post a new question if you'd like help with the rest.

In a fastq file, how do I change the sequence headers to the file name and a unique identifier?

Answers (1)

Related Questions