JonnyQuest
JonnyQuest

Reputation: 21

Bash scripting, regex in if statement

I'm pretty new to bash scripting and regexp and have a question. I want to check to see if my variable $name starts with a-d, e-h, i-l etc and do some stuff accordingly. If the string starts with "the." or "The." it should check the first letter after the period.

My problem is that if $name consists of "the.anchor" both the a-d0-9 and q-t will be true. Do you guys have any idea what's wrong?

if [[ $name =~ ^([tT]he\.)?[a-dA-D0-9]+ ]]; then
    do some stuff
fi

if [[ $name =~ ^([tT]he\.)?[e-hE-H]+ ]]; then
    do some stuff
fi

if [[ $name =~ ^([tT]he\.)?[i-lI-L]+ ]]; then
    do some stuff
fi

if [[ $name =~ ^([tT]he\.)?[m-pM-P]+ ]]; then
    do some stuff
fi

if [[ $name =~ ^([tT]he\.)?[q-tQ-T]+ ]]; then
    do some stuff
fi

if [[ $name =~ ^([tT]he\.)?[u-wU-W]+ ]]; then
    do some stuff
fi

if [[ $name =~ ^([tT]he\.)?[x-zX-Z]+ ]]; then
    do some stuff
fi

Thanks in advance!

Upvotes: 2

Views: 295

Answers (3)

John B
John B

Reputation: 3646

I think the ? can be removed as the if statement is already doing the test. The + matches the preceding item at least once and would only be needed if you want to match more than one instance of the letters.

You can do it like this:

if [[ $name =~ ^[tT]he\.[a-dA-D0-9] ]]; then
    do some stuff
fi

The condition will only return true if the first character after ^[tT]he\. is [a-dA-D0-9].

However, I tend to think case is a cleaner solution than if statements when matching lists of characters against variables.

case $name in
    [tT]he\.[a-dA-D0-9]*)
        do some stuff
        ;;
esac

Upvotes: 0

JonnyQuest
JonnyQuest

Reputation: 21

I figured out a way to fix my problem by using elif statements and putting the q-t part as the last one

Upvotes: 0

Boris the Spider
Boris the Spider

Reputation: 61198

Your first part it optional:

([tT]he\.)?

So the.anchor matches the pattern ^([tT]he\.)?[a-dA-D0-9]+ because the the. matches `^([tT]he\.)? and the a matches [a-dA-D0-9]+. It matches ^([tT]he\.)?[q-tQ-T]+ because ^([tT]he\.)? is optional an t matches [q-tQ-T]+. Note not the whole input is consumed by the second pattern, in fact only the first character is grabbed.

You can verify this by having bash echo the match:

echo "${BASH_REMATCH[0]}"

Which should print the.anchor in the first case and t in the second.

You do not have an end anchor on the pattern so only part of the input needs to be matched. If you made the second pattern ^([tT]he\.)?[q-tQ-T]+$ then it would not match.

Alternatively you could make the the first part possessive - ^([tT]he\.)?+. This will mean that if the engine matches the first expression it will not be unmatched. In the latter case ^([tT]he\.)?+ will grab the the. and then not release it when [q-tQ-T]+ fails; this will cause the match to fail.

Upvotes: 2

Related Questions