Reputation: 269
I'm very new to terminal/bash, and perhaps this has been asked before but I wasn't able to find what I'm looking for perhaps because I'm not sure exactly what to search for to answer my question.
I'm trying to format some files for genetic analysis and while I could write out the following command for every sample file, I know there is a better way:
zcat myfile.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > myfile.2.fastq.gz
zcat myfile.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > myfile.1.fastq.gz
I have the following files:
-bash-3.2$ ls
BB001.fastq BB013.fastq.gz IN014.fastq.gz RV006.fastq.gz SL083.fastq.gz
BB001.fastq.gz BB014.fastq.gz INA01.fastq.gz RV007.fastq.gz SL192.fastq.gz
BB003.fastq.gz BB015.fastq.gz INA02.fastq.gz RV008.fastq.gz SL218.fastq.gz
BB004.fastq.gz IN001.fastq.gz INA03.fastq.gz RV009.fastq.gz SL276.fastq.gz
BB006.fastq.gz IN002.fastq.gz INA04.fastq.gz RV010.fastq.gz SL277.fastq.gz
BB008.fastq.gz IN007.fastq.gz INA05.fastq.gz RV011.fastq.gz SL326.fastq.gz
BB009.fastq.gz IN010.fastq.gz INA1M.fastq.gz RV012.fastq.gz SL392.fastq.gz
BB010.fastq.gz IN011.fastq.gz RV003.fastq.gz SL075.fastq.gz SL393.fastq.gz
BB011.fastq.gz IN012.fastq.gz RV004.fastq.gz SL080.fastq.gz SL395.fastq.gz
BB012.fastq.gz IN013.fastq.gz RV005.fastq.gz SL081.fastq.gz
and I would like to apply the two zcat functions to each file, creating two new files from each one without writing it out 50 times. I've used for loops in R quite a bit but don't know where to start in bash. I can say in words what I want and hopefully someone can give me a hand coding it!:
for FILENAME.fastq.gz in all files in cd
zcat FILENAME.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > FILENAME.2.fastq.gz
zcat FILENAME.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > FILENAME.1.fastq.gz
Thanks a ton in advance for your help!
*****EDIT*****
My notation was a bit off, here's the final, correct for loop:
for fname in *.fastq.gz
do
gzcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.2.fastq.gz"
gzcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.1.fastq.gz"
done
*****FOLLOWUP QUESTION*****
When I run the following:
for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done
I get this error:
cat: ./CleanedSeparate/XhoI/*.1.fastq.gz: No such file or directory
cat: ./CleanedSeparate/MseI/*.2.fastq.gz: No such file or directory
Obviously I'm not using * correctly. Any tips on where I'm going wrong?
Upvotes: 4
Views: 11704
Reputation: 113844
for fname in *.fastq.gz
do
zcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >"${fname%.fastq.gz}.2.fastq.gz"
zcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >"${fname%.fastq.gz}.1.fastq.gz"
done
Key points:
for fname in *.fastq.gz
This loops over every file in the current directory ending in .fastq.gz
. If the files are in a different directory, then use:
for fname in /path/to/*.fastq.gz
where /path/to/
is whatever the path should be to get to those files.
zcat "$fname"
This part is straightforward. It substitutes in the file name as the argument for zcat
.
"${fname%.fastq.gz}.1.fastq.gz"
This is a little bit trickier. To get the desired output file name, we need to insert the .1
into the original filename. The easiest way to do this in bash
is to remove the .fastq.gz
suffix from the file name with ${fname%.fastq.gz}
where the %
is bash-speak meaning remove what follows from the end. Then, we add on the new suffix .1.fastq.gz
and we have the correct file name.
As per the follow-up question, this does not work:
for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done
The problem is that, in the for
statement, the shell is looking for the *.1.fastq.gz
in the current directory. But, they aren't there. They are in the ./CleanedSeparate/XhoI/
. Instead, run:
dir1=./CleanedSeparate/XhoI
for fname in "$dir1"/*.1.fastq.gz
do
base=${fname#$dir1/}
base=${base%.1.fastq.gz}
echo "base=$base"
cat "$fname" "./CleanedSeparate/MseI/${base}.2.fastq.gz" >"./FinalCleaned/${base}.fastq.gz"
done
Notice here that the for
statement is given the correct directory in which to find the files.
Upvotes: 6
Reputation: 881503
You can use something like:
for fspec in *.fastq.gz ; do
echo "${fspec}"
done
That will simply echo the file being processed but you can do anything you want to ${fspec}
, including using it for a couple of zcat
commands.
In order to get the root of the file name (for creating the other files), you can use the pattern deletion feature of bash
to remove the trailing bit:
for fspec in *.fastq.gz ; do
froot=${fspec%%.fastq.gz}
echo "Transform ${froot}.fastq.gz into ${froot}.1.fastq.gz"
done
In addition, for your specific need, it appears you want to send the first four lines of an eight-line group to one file and the other four lines to a second file.
I tend to just use sed
for simple tasks like that since it's likely to be faster. You can get the first line group (first four lines of the eight) with:
sed -n 'p;n;p;n;p;n;p;n;n;n;n'
and the second (second four lines of the eight) with:
sed -n 'n;n;n;n;p;n;p;n;p;n;p'
using the p
print-current and n
get-next commands.
Hence the code then becomes something like:
for fsrc in *.fastq.gz ; do
fdst1="${fspec%%.fastq.gz}.1.fastq.gz"
fdst2="${fspec%%.fastq.gz}.2.fastq.gz"
echo "Processing ${fsrc}"
# For each group of 8 lines, fdst1 gets 1-4, fdst2 gets 5-8.
zcat ${fsrc} | sed -n 'p;n;p;n;p;n;p;n;n;n;n' | gzip >${fdst1}
zcat ${fsrc} | sed -n 'n;n;n;n;p;n;p;n;p;n;p' | gzip >${fdst2}
done
Upvotes: 0