Cryssie
Cryssie

Reputation: 3175

Regex to match certain sentence pattern with Python

I'm trying to find if a particular sentence pattern has an abbreviated word like R.E.M. or CEO. An abbreviated words that I am looking for is words with capital letters punctuated with period like R.E.M. or all caps.

#sentence pattern = 'What is/was a/an(optional) word(abbreviated or not) ?
sentence1 = 'What is a CEO'
sentence2 = 'What is a geisha?'
sentence3 = 'What is ``R.E.M.``?'

This is what I have but it's not returning anything at all. It doesn't recognise the pattern. I can't figure out what is wrong with the regex.

c5 = re.compile("^[w|W]hat (is|are|was|were|\'s)( a| an| the)*( \`\`)*( [A-Z\.]+\s)*( \'\')* \?$")
if c5.match(question):
    return "True."

EDIT: I am looking to see if the sentence pattern above has an abbreviated word.

Upvotes: 1

Views: 4734

Answers (4)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

You can try this pattern:

c5 = re.compile(r"^[wW]hat (?:is|are|w(?:as|ere)|'s)(?: (?:an?|the))? ([`'\"]*)((?:[A-Z]\.)+|[A-Z]+)\1 ?\??$")

explanations:

I use non capturing groups (?:..) instead of of capturing groups (..) assuming that you don't need to extract what there is inside (except for the abbreviation).

[w|W] is replaced by [wW] since | in a character class is seen as literal.

To make the different quotes optional around the abbreviation, I use a capture group before (that can be void): ([`'\"]*) and I use a backreference after the abbreviation (i.e.: \1)

The abbreviation is described as an alternation between (?:[A-Z]\.)+ (uppercase letter with a dot) or just uppercase [A-Z].

I allow no space between the abbreviation and the question mark (that is optional too now, thanks to FooBar for these notices) by making the space optional.

Upvotes: 0

Alnilam
Alnilam

Reputation: 3391

You've got a few issues. It's not really clear from your examples what sort of quoting might be expected, or if you want to match the ones that don't end in question marks. Your regex uses * (zero or any number of the previous) when I think you can use ? (zero or one of the previous). You also will miss sentences with What's even though I think you want those, because you're looking for What 's instead.

Here's a possible solution:

 import re
 sentence1 = "What is a CEO"
 sentence2 = "What is a geisha?"
 sentence3 = "What is ``R.E.M.``?"
 sentence4 = "What's SCUBA?"

 c1 = re.compile(r"^[wW]hat(?: is| are| was| were|\'s)(?: a| an| the)? [`']{0,2}((?:[A-Z]\.)+|[A-Z]+)[`']{0,2} ?\??")

 def test(question, regex):
     if regex.match(question):
         return "Matched!"
     else:
         return "Nope!"

 test(sentence1,c1)
 > "Matched!"
 test(sentence2,c1)
 > "Nope!"
 test(sentence3,c1)
 > "Matched!"
 test(sentence4,c1)
 > "Matched!"     

But it could probably be tweaked more depending on whether you expect the abbreviation to be double-quoted, for example.

Upvotes: 1

ebenpack
ebenpack

Reputation: 458

This should work:

re.compile("^[wW]hat (is|are|was|were) ((a|an|the) )*(['"`]*)([A-Z\.]*)(['"`]*)\?$")

You can make some/all of the groups non-capturing if necessary, or you can make the terminating question mark optional (I noticed it's missing from one of your examples). There are a few tweaks that could be made here and there, but this pretty much does it.

Upvotes: 0

Jongware
Jongware

Reputation: 22437

The position of the spaces before and after your abbreviation check are off.

You might also want to check your quote handling. Perhaps it's just an artefact of posting your code here, but there seems to be some confusion with your ' and `'s. Try

['`"]*

instead for both.

Upvotes: 0

Related Questions