Darren Haynes
Darren Haynes

Reputation: 1353

Python - Don't Understand The Returned Results of This Concatenated Regex Pattern

I am a Python newb trying to get more understanding of regex. Just when I think I got a good grasp of the basics, something throws me - such as the following:

>>> import re

>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s' + '|'.join(noun_list) + r'\s'
>>> found = re.findall(noun_patt, text)
>>> found
[' eggs', 'bacon', 'donkey']

Since I set the regex pattern to find 'whitespace' + 'pipe joined list of nouns' + 'whitespace' - how come:

' eggs' was found with a space before it and not after it? 'bacon' was found with no spaces either side of it? 'donkey' was found with no spaces either side of it and the fact there is no whitespace after it?

The result I was expecting: [' eggs ', ' bacon ']

I am using Python 2.7

Upvotes: 2

Views: 63

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1123460

You misunderstand the pattern. There is no group around the joint list of nouns, so the first \s is part of the eggs option, the bacon and donkey options have no spaces, and the dog option includes the final \s meta character .

You want to put a group around the nouns to delimit what the | option applies to:

noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))

The non-capturing group here ((?:...)) puts a limit on what the | options apply to. The \s spaces are now outside of the group and are thus not part of the 4 choices.

You need to use a non-capturing group because if you were to use a regular (capturing) group .findall() would return just the noun, not the spaces.

Demo:

>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
>>> re.findall(noun_patt, text)
[' eggs ', ' bacon ']

Now both spaces are part of the output.

Upvotes: 5

Related Questions