Reputation: 43
I'm working with barcoded data and I want to be able to combine the fastq files and easily be able to tell which barcode the read originally had. So I am trying to change the names of the reads to the name of the file (i.e barcode01.fastq) and append a unique number to the end. I want the final product to be something like:
> barcode01_1
AAGTAGCCTTCGTTCAGTTACGTATTG
+
(&)(*),+*)'(5<64?CA?<;A=@D6
> barcode01_2
...
So far I have the following using awk:
find . -type f -printf "/%P\n" | \
while read FILE ;
do
PREFIX=$(echo ">",${FILE##*/}"_");
awk -v PREFIX=$PREFIX '{
if (NR%4 == 1)
print PREFIX, ++i;
else
print $0
}' ${FILE} > ../${FILE}.fastq;
done
This grabs all the fastq files from subdirectories and makes the headers > barcode01_ 1
but I cannot figure out how to get rid of the space. If I remove the comma between PREFIX
and ++i
:
find . -type f -printf "/%P\n" | \
while read FILE ;
do
PREFIX=$(echo ">",${FILE##*/}"_");
awk -v PREFIX=$PREFIX '{
if (NR%4 == 1)
print PREFIX ++i;
else
print $0
}' ${FILE} > ../${FILE}.fastq;
done
This makes the headers only increasing numbers without the > barcode01_
portion.
Upvotes: 1
Views: 1361
Reputation: 204446
Putting ++
pushed up against the i
doesn't mean awk will apply it to i
instead of PREFIX
. Compare:
$ awk 'BEGIN{PREFIX="foo"; print PREFIX ++i}'
0
with:
$ awk 'BEGIN{PREFIX="foo"; print PREFIX (++i)}'
foo1
There are several other issues with your script, copy/paste it into http://shellcheck.net to learn about some of them then post a new question if you'd like help with the rest.
Upvotes: 1