Reputation: 397
I created this little Bash script that has one argument (a filename) and the script is supposed to respond according to the extension of the file:
#!/bin/bash
fileFormat=${1}
if [[ ${fileFormat} =~ [Ff][Aa]?[Ss]?[Tt]?[Qq]\.?[[:alnum:]]+$ ]]; then
echo "its a FASTQ file";
elif [[ ${fileFormat} =~ [Ss][Aa][Mm] ]]; then
echo "its a SAM file";
else
echo "its not fasta nor sam";
fi
It's ran like this:
sh script.sh filename.sam
If it's a fastq (or FASTQ, or fq, or FQ, or fastq.gz (compressed)) I want the script to tell me "it's a fastq". If it's a sam, I want it to tell me it's a sam, and if not, I want to tell me it's neither sam or fastq.
THE PROBLEM: when I didn't consider the .gz (compressed) scenario, the script ran well and gave the result I expected, but something is happening when I try to add that last part to account for that situation (see third line, the part where it says .?[[:alnum:]]+ ). This part is meant to say "in the filename, after the extension (fastq in this case), there might be a dot plus some word afterwards".
My input is this:
sh script.sh filename.fastq.gz
And it works. But if I put: sh script.sh filename.fastq
It says it's not fastq. I wanted to put that last part as optional, but if I add a "?" at the end it doesn't work. Any thoughts? Thanks! My question would be to fix that part in order to work for both cases.
Upvotes: 3
Views: 357
Reputation: 784998
You may use this regex:
fileFormat="$1"
if [[ $fileFormat =~ [Ff]([Aa][Ss][Tt])?[Qq](\.[[:alnum:]]+)?$ ]]; then
echo "its a FASTQ file"
elif [[ $fileFormat =~ [Ss][Aa][Mm]$ ]]; then
echo "its a SAM file"
else
echo "its not fasta nor sam"
fi
Here (\.[[:alnum:]]+)?
makes last group optional which is dot followed by 1+ alphanumeric characters.
When you run it as:
./script.sh filename.fastq
its a FASTQ file
./script.sh fq
its a FASTQ file
./script.sh filename.fastq.gz
its a FASTQ file
./script.sh filename.sam
its a SAM file
./script.sh filename.txt
its not fasta nor sam
Upvotes: 4
Reputation: 189327
The immediate problem is that you are requiring at least one [[:alnum:]]
character after .fastq
. This is easy to fix per se with *
instead of +
.
Regex is not a particularly happy solution to this problem, though.
case $fileFormat in
*.[Ff][Aa][Ss][Tt][Qq] | *.[Ff][Aa][Ss][Tt][Qq].*)
echo "$0: $fileFormat is a FASTQ file" >&2 ;;
*.[Ss][Aa][Mm] )
echo "$0: $fileFormat is a SAM file" >%2 ;;
esac
is portable all the way back to the original Bourne sh
. In Bash 4.x you could lowercase the filename before the comparison so as to simplify the glob patterns.
Notice also how the diagnostics contain the name of the script and print to standard error instead of standard output.
Upvotes: 1