Reputation: 379
I'm having some strange behavior from bash here. I have several files, some of which are in multiple parts. Each file called *_Rx_00y.fastq.gz should be concatenated with x as an identifier, that is R1_001 and R1_002 (as well as a hypothetical R1_003) go together.
[mark@theNosebook Sample_P4]$ ls -lh
total 822M
-rwxr-xr-x 1 mark mark 404M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_002.fastq.gz
-rwxr-xr-x 1 mark mark 414M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_002.fastq.gz
-rwxr-xr-x 1 mark mark 144 Aug 13 12:25 SampleSheet.csv
I wish to take both *_R1_00x.fastq.gz files and concatenate them to the first. I realize I could use >>
here, but it seems unwieldly if I have more than 2 entries. My solution, which I think should work is:
name=$(ls *_R1_001.fastq.gz)
cat $(ls *_R1_*) > ${name}
however, here I get
[mark@theNosebook Sample_P4]$ ls -lh
total 421M
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:37 P4_CTCTCTAC-AGAGTAGA_L002_R1_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_002.fastq.gz
-rwxr-xr-x 1 mark mark 414M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_002.fastq.gz
-rwxr-xr-x 1 mark mark 144 Aug 13 12:25 SampleSheet.csv
Note that the size of the resultant output is that of only the second file (2.6M). Writing them to a separate file, here cat, works fine.
[mark@theNosebook Sample_P4]$ cat $(ls *_R1_*) > cat
[mark@theNosebook Sample_P4]$ ls -lh
total 1.2G
-rw-r--r-- 1 mark mark 407M Aug 13 12:36 cat
-rwxr-xr-x 1 mark mark 404M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_002.fastq.gz
-rwxr-xr-x 1 mark mark 414M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_002.fastq.gz
-rwxr-xr-x 1 mark mark 144 Aug 13 12:25 SampleSheet.csv
What's going on here? I would like to preserve the file names, as they reference the samples from which they were taken.
Thanks
Upvotes: 1
Views: 821
Reputation: 46826
Since you want to preserve filenames, I gather that everything in the filename up to the last underscore is candidate for preservation, with those last three digits being an incrementing chunk identifier.
As such, you might want to process large quantities of these files, and not have to modify your script for each one.
How about this?
#!/usr/bin/env bash
# Detect a "-f" option, which forces recreation of files.
if [ "$1" = "-f" ]; then
force=true
else
force=false
fi
# First, get our list of prefixes into an array,
# stripping from the last underscore to the end of each name.
a=(*.fastq.gz)
prefixes="${a[@]%_*}"
# Next, step through the prefixes array, concatenating the chunks.
for prefix in "${prefixes[@]}"; do
if [ ! -s "${prefix}_joined.fastq.gz" ] || $force; then
cat "${prefix}"_[0-9]*.fastq.gz > "${prefix}_joined.fastq.gz"
fi
done
Note the "-f" option. I've included it so that if you are running this on a large collection of files, the script will quickly skip files that have been processed during a previous batch.
I recommend joining your files in separate files rather than overwriting your first file, so that if something goes wrong, you haven't corrupted your original data. Results should be reproducible, after all! :-)
Upvotes: 1
Reputation: 530970
You don't need to use ls
; whatever pattern you use with ls
can just as well be used to populate an array, whose contents can then be used as the argument to cat
. Write everything to a temp file first, to ensure the concatenation succeeds before overwriting the first file.
to_cat=( *_R1_* )
tmp=$(mktemp)
cat "${to_cat[@]}" > "$tmp" && mv "$tmp" "${to_cat[0]}"
You can optionally ensure that you found files to concatenate. (I'd recommend it, just to be safe.)
shopt -s nullglob
to_cat=( *_R1_*)
tmp=$(mktemp)
(( ${#to_cat[@]} )) && cat "${to_cat[@]}" > "$tmp" && mv "$tmp" "${to_cat[0]}"
Upvotes: 1
Reputation: 69198
You have to gunzip
first
Try:
gunzip -c *_R1_001.fastq.gz | gzip > result.gz
Upvotes: -1