NorseCode
NorseCode

Reputation: 121

Python: return full word and not just a specfic part of the String(regular expressions)

I just recently started learning Python and I've gotten as "far" as regular expressions. My task seems fairly simple. I just need to write a regular expression that returns certain words from a String. The rules are as follows: the word can only contain a single group of vowels. In other words, it's an imperfect but simple regular expression meant to return one syllable words from a text.

I believe that the regular expression I have written isn't too far off, but I only get parts of the string back, rather than the full word. Example below:

>>> import re

>>> text = "A boy named Sue tried to kill a swamp monkey, but failed miserably. He then cried. Boo hoo."

>>> re.findall("[^aeiou][aeiou]{1,}[^aeiou]", text)
['boy', 'nam', 'Sue ', 'ried', 'to ', 'kil', ' a ', 'wam', 'mon', 'key', 'but', 'fail', 'mis', 'rab', 'He ', 'hen', 'ried', 'Boo ', 'hoo.']

As you can see, the result isn't correct. It just splits the string to fit my regular expression, rather than return the word that it came from. Moreover, some of the strings that are returned aren't even from words that fit my criteria.

Thanks in advance!

Upvotes: 2

Views: 1221

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336158

This is a bit complicated (if I understand your requirements):

regex = re.compile(
    r"""\b           # Match the start of a word
    [^\W\d_aeiou]*   # Match any number letters except vowels
    [aeiou]+         # Match one or more vowels
    [^\W\d_aeiou]*   # Match any number letters except vowels
    \b               # Match the start of a word""", 
    re.VERBOSE|re.IGNORECASE)

You can then use it like this:

>>> regex.findall("A boy named Sue tried to kill a swamp monkey, but failed miserably. He then cried. Boo hoo.")
['A', 'boy', 'Sue', 'tried', 'to', 'kill', 'a', 'swamp', 'but', 'He', 'then', 'cried', 'Boo', 'hoo']

Explanation:

[^\W\d_aeiou] is a bit hard to understand:

  • \w matches any letter, digit or underscore.
  • \W matches any character that \w doesn't match.
  • [^\W] therefore matches the same as \w. But we can now add more characters to this negated character class that should be subtracted from the set of valid characters.
  • [^\W\d_aeiou] therefore matches anything that \w matches, but without the digits, underscore or vowels.
  • The upside of this approach (instead of using [bcdfghjklmnpqrstvwxyz] is that \w is Unicode-aware (natively in Python 3, by request in Python 2 if you add the re.U flag) and will therefore not be limited to ASCII letters.

Upvotes: 5

Related Questions