msimmer92
msimmer92

Reputation: 397

Regex pattern that recognises file extension in Bash script not accurate to capture compressed files

I created this little Bash script that has one argument (a filename) and the script is supposed to respond according to the extension of the file:

#!/bin/bash

fileFormat=${1}

if [[ ${fileFormat} =~ [Ff][Aa]?[Ss]?[Tt]?[Qq]\.?[[:alnum:]]+$ ]]; then
    echo "its a FASTQ file";
elif [[ ${fileFormat} =~ [Ss][Aa][Mm] ]]; then
    echo "its a SAM file";
else
    echo "its not fasta nor sam";
fi

It's ran like this:

sh script.sh filename.sam

If it's a fastq (or FASTQ, or fq, or FQ, or fastq.gz (compressed)) I want the script to tell me "it's a fastq". If it's a sam, I want it to tell me it's a sam, and if not, I want to tell me it's neither sam or fastq.

THE PROBLEM: when I didn't consider the .gz (compressed) scenario, the script ran well and gave the result I expected, but something is happening when I try to add that last part to account for that situation (see third line, the part where it says .?[[:alnum:]]+ ). This part is meant to say "in the filename, after the extension (fastq in this case), there might be a dot plus some word afterwards".

My input is this:

sh script.sh filename.fastq.gz

And it works. But if I put: sh script.sh filename.fastq

It says it's not fastq. I wanted to put that last part as optional, but if I add a "?" at the end it doesn't work. Any thoughts? Thanks! My question would be to fix that part in order to work for both cases.

Upvotes: 3

Views: 357

Answers (2)

anubhava
anubhava

Reputation: 784998

You may use this regex:

fileFormat="$1"

if [[ $fileFormat =~ [Ff]([Aa][Ss][Tt])?[Qq](\.[[:alnum:]]+)?$ ]]; then
    echo "its a FASTQ file"
elif [[ $fileFormat =~ [Ss][Aa][Mm]$ ]]; then
    echo "its a SAM file"
else
    echo "its not fasta nor sam"
fi

Here (\.[[:alnum:]]+)? makes last group optional which is dot followed by 1+ alphanumeric characters.

When you run it as:

./script.sh filename.fastq
its a FASTQ file

./script.sh fq
its a FASTQ file

./script.sh filename.fastq.gz
its a FASTQ file

./script.sh filename.sam
its a SAM file

./script.sh filename.txt
its not fasta nor sam

Upvotes: 4

tripleee
tripleee

Reputation: 189327

The immediate problem is that you are requiring at least one [[:alnum:]] character after .fastq. This is easy to fix per se with * instead of +.

Regex is not a particularly happy solution to this problem, though.

case $fileFormat in
    *.[Ff][Aa][Ss][Tt][Qq] | *.[Ff][Aa][Ss][Tt][Qq].*)
        echo "$0: $fileFormat is a FASTQ file" >&2 ;;
    *.[Ss][Aa][Mm] )
        echo "$0: $fileFormat is a SAM file" >%2 ;;
esac

is portable all the way back to the original Bourne sh. In Bash 4.x you could lowercase the filename before the comparison so as to simplify the glob patterns.

Notice also how the diagnostics contain the name of the script and print to standard error instead of standard output.

Upvotes: 1

Related Questions