mnosefish
mnosefish

Reputation: 379

using a variable name as the output of cat

I'm having some strange behavior from bash here. I have several files, some of which are in multiple parts. Each file called *_Rx_00y.fastq.gz should be concatenated with x as an identifier, that is R1_001 and R1_002 (as well as a hypothetical R1_003) go together.

    [mark@theNosebook Sample_P4]$ ls -lh
    total 822M
    -rwxr-xr-x 1 mark mark 404M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_001.fastq.gz
    -rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_002.fastq.gz
    -rwxr-xr-x 1 mark mark 414M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_001.fastq.gz
    -rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_002.fastq.gz
    -rwxr-xr-x 1 mark mark  144 Aug 13 12:25 SampleSheet.csv

I wish to take both *_R1_00x.fastq.gz files and concatenate them to the first. I realize I could use >> here, but it seems unwieldly if I have more than 2 entries. My solution, which I think should work is:

    name=$(ls *_R1_001.fastq.gz)
    cat $(ls *_R1_*) > ${name}

however, here I get

[mark@theNosebook Sample_P4]$ ls -lh
total 421M
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:37 P4_CTCTCTAC-AGAGTAGA_L002_R1_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_002.fastq.gz
-rwxr-xr-x 1 mark mark 414M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_002.fastq.gz
-rwxr-xr-x 1 mark mark  144 Aug 13 12:25 SampleSheet.csv

Note that the size of the resultant output is that of only the second file (2.6M). Writing them to a separate file, here cat, works fine.

[mark@theNosebook Sample_P4]$ cat $(ls *_R1_*) > cat
[mark@theNosebook Sample_P4]$ ls -lh
total 1.2G
-rw-r--r-- 1 mark mark 407M Aug 13 12:36 cat
-rwxr-xr-x 1 mark mark 404M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R1_002.fastq.gz
-rwxr-xr-x 1 mark mark 414M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_001.fastq.gz
-rwxr-xr-x 1 mark mark 2.6M Aug 13 12:25 P4_CTCTCTAC-AGAGTAGA_L002_R2_002.fastq.gz
-rwxr-xr-x 1 mark mark  144 Aug 13 12:25 SampleSheet.csv

What's going on here? I would like to preserve the file names, as they reference the samples from which they were taken.

Thanks

Upvotes: 1

Views: 821

Answers (3)

ghoti
ghoti

Reputation: 46826

Since you want to preserve filenames, I gather that everything in the filename up to the last underscore is candidate for preservation, with those last three digits being an incrementing chunk identifier.

As such, you might want to process large quantities of these files, and not have to modify your script for each one.

How about this?

#!/usr/bin/env bash

# Detect a "-f" option, which forces recreation of files.
if [ "$1" = "-f" ]; then
  force=true
else
  force=false
fi

# First, get our list of prefixes into an array,
# stripping from the last underscore to the end of each name.
a=(*.fastq.gz)
prefixes="${a[@]%_*}"

# Next, step through the prefixes array, concatenating the chunks.
for prefix in "${prefixes[@]}"; do
  if [ ! -s "${prefix}_joined.fastq.gz" ] || $force; then
    cat "${prefix}"_[0-9]*.fastq.gz > "${prefix}_joined.fastq.gz"
  fi
done

Note the "-f" option. I've included it so that if you are running this on a large collection of files, the script will quickly skip files that have been processed during a previous batch.

I recommend joining your files in separate files rather than overwriting your first file, so that if something goes wrong, you haven't corrupted your original data. Results should be reproducible, after all! :-)

Upvotes: 1

chepner
chepner

Reputation: 530970

You don't need to use ls; whatever pattern you use with ls can just as well be used to populate an array, whose contents can then be used as the argument to cat. Write everything to a temp file first, to ensure the concatenation succeeds before overwriting the first file.

to_cat=( *_R1_* )
tmp=$(mktemp)
cat "${to_cat[@]}" > "$tmp" && mv "$tmp" "${to_cat[0]}"

You can optionally ensure that you found files to concatenate. (I'd recommend it, just to be safe.)

shopt -s nullglob
to_cat=( *_R1_*)
tmp=$(mktemp)
(( ${#to_cat[@]} )) && cat "${to_cat[@]}" > "$tmp" && mv "$tmp" "${to_cat[0]}"

Upvotes: 1

Diego Torres Milano
Diego Torres Milano

Reputation: 69198

You have to gunzip first

Try:

gunzip -c *_R1_001.fastq.gz | gzip > result.gz

Upvotes: -1

Related Questions