stewart6
stewart6

Reputation: 269

Running zcat on multiple files using a for loop

I'm very new to terminal/bash, and perhaps this has been asked before but I wasn't able to find what I'm looking for perhaps because I'm not sure exactly what to search for to answer my question.

I'm trying to format some files for genetic analysis and while I could write out the following command for every sample file, I know there is a better way:

zcat myfile.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > myfile.2.fastq.gz
zcat myfile.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > myfile.1.fastq.gz

I have the following files:

-bash-3.2$ ls
BB001.fastq BB013.fastq.gz  IN014.fastq.gz  RV006.fastq.gz  SL083.fastq.gz
BB001.fastq.gz  BB014.fastq.gz  INA01.fastq.gz  RV007.fastq.gz  SL192.fastq.gz
BB003.fastq.gz  BB015.fastq.gz  INA02.fastq.gz  RV008.fastq.gz  SL218.fastq.gz
BB004.fastq.gz  IN001.fastq.gz  INA03.fastq.gz  RV009.fastq.gz  SL276.fastq.gz
BB006.fastq.gz  IN002.fastq.gz  INA04.fastq.gz  RV010.fastq.gz  SL277.fastq.gz
BB008.fastq.gz  IN007.fastq.gz  INA05.fastq.gz  RV011.fastq.gz  SL326.fastq.gz
BB009.fastq.gz  IN010.fastq.gz  INA1M.fastq.gz  RV012.fastq.gz  SL392.fastq.gz
BB010.fastq.gz  IN011.fastq.gz  RV003.fastq.gz  SL075.fastq.gz  SL393.fastq.gz
BB011.fastq.gz  IN012.fastq.gz  RV004.fastq.gz  SL080.fastq.gz  SL395.fastq.gz
BB012.fastq.gz  IN013.fastq.gz  RV005.fastq.gz  SL081.fastq.gz

and I would like to apply the two zcat functions to each file, creating two new files from each one without writing it out 50 times. I've used for loops in R quite a bit but don't know where to start in bash. I can say in words what I want and hopefully someone can give me a hand coding it!:

for FILENAME.fastq.gz in all files in cd

zcat FILENAME.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > FILENAME.2.fastq.gz
zcat FILENAME.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > FILENAME.1.fastq.gz

Thanks a ton in advance for your help!

*****EDIT*****

My notation was a bit off, here's the final, correct for loop:

for fname in *.fastq.gz
do
    gzcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.2.fastq.gz"
    gzcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.1.fastq.gz"
done

*****FOLLOWUP QUESTION*****

When I run the following:

for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done

I get this error:

cat: ./CleanedSeparate/XhoI/*.1.fastq.gz: No such file or directory
cat: ./CleanedSeparate/MseI/*.2.fastq.gz: No such file or directory

Obviously I'm not using * correctly. Any tips on where I'm going wrong?

Upvotes: 4

Views: 11704

Answers (2)

John1024
John1024

Reputation: 113844

for fname in *.fastq.gz
do
    zcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >"${fname%.fastq.gz}.2.fastq.gz"
    zcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >"${fname%.fastq.gz}.1.fastq.gz"
done

Key points:

  • for fname in *.fastq.gz

    This loops over every file in the current directory ending in .fastq.gz. If the files are in a different directory, then use:

    for fname in /path/to/*.fastq.gz
    

    where /path/to/ is whatever the path should be to get to those files.

  • zcat "$fname"

    This part is straightforward. It substitutes in the file name as the argument for zcat.

  • "${fname%.fastq.gz}.1.fastq.gz"

    This is a little bit trickier. To get the desired output file name, we need to insert the .1 into the original filename. The easiest way to do this in bash is to remove the .fastq.gz suffix from the file name with ${fname%.fastq.gz} where the % is bash-speak meaning remove what follows from the end. Then, we add on the new suffix .1.fastq.gz and we have the correct file name.

Creating the new files in a different directory

As per the follow-up question, this does not work:

for fname in *.1.fastq.gz
do
    cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done

The problem is that, in the for statement, the shell is looking for the *.1.fastq.gz in the current directory. But, they aren't there. They are in the ./CleanedSeparate/XhoI/. Instead, run:

dir1=./CleanedSeparate/XhoI
for fname in "$dir1"/*.1.fastq.gz
do
    base=${fname#$dir1/}
    base=${base%.1.fastq.gz}
    echo "base=$base"
    cat "$fname" "./CleanedSeparate/MseI/${base}.2.fastq.gz" >"./FinalCleaned/${base}.fastq.gz"
done

Notice here that the for statement is given the correct directory in which to find the files.

Upvotes: 6

paxdiablo
paxdiablo

Reputation: 881503

You can use something like:

for fspec in *.fastq.gz ; do
    echo "${fspec}"
done

That will simply echo the file being processed but you can do anything you want to ${fspec}, including using it for a couple of zcat commands.


In order to get the root of the file name (for creating the other files), you can use the pattern deletion feature of bash to remove the trailing bit:

for fspec in *.fastq.gz ; do
    froot=${fspec%%.fastq.gz}
    echo "Transform ${froot}.fastq.gz into ${froot}.1.fastq.gz"
done

In addition, for your specific need, it appears you want to send the first four lines of an eight-line group to one file and the other four lines to a second file.

I tend to just use sed for simple tasks like that since it's likely to be faster. You can get the first line group (first four lines of the eight) with:

sed -n 'p;n;p;n;p;n;p;n;n;n;n'

and the second (second four lines of the eight) with:

sed -n 'n;n;n;n;p;n;p;n;p;n;p'

using the p print-current and n get-next commands.

Hence the code then becomes something like:

for fsrc in *.fastq.gz ; do
    fdst1="${fspec%%.fastq.gz}.1.fastq.gz"
    fdst2="${fspec%%.fastq.gz}.2.fastq.gz"
    echo "Processing ${fsrc}"

    # For each group of 8 lines, fdst1 gets 1-4, fdst2 gets 5-8.
    zcat ${fsrc} | sed -n 'p;n;p;n;p;n;p;n;n;n;n' | gzip >${fdst1}
    zcat ${fsrc} | sed -n 'n;n;n;n;p;n;p;n;p;n;p' | gzip >${fdst2}
done

Upvotes: 0

Related Questions