I'm trying to match a specific pattern using the re module in python. I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation) Eg. "This is a regular sentence." "this is also valid" "so is This ONE" I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still). I'm tried: "((\w+)(\s?))*" To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result) The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')]. "(\w+ ?)*" I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE. In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the ' ' operator? The output I get with this is ['sentence']. Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over. Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to? P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

Reputation: 12170

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python. I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)

Eg.

"This is a regular sentence."
"this is also valid"
"so is This ONE"

I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).

I'm tried:

"((\w+)(\s?))*"

To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result) The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"

I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE. In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.

Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?

P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

Upvotes: 2

Answers (4)

ChipJust

Reputation: 1416

Maybe this will help:

import re

source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one  followed by this one
"""

re_sentence = re.compile(r'[^ \n.].*?(\.|\n|  +)')

def main():
    i = 0
    for s in re_sentence.finditer(source):
        print "%d:%s" % (i, s.group(0))
        i += 1

if __name__ == '__main__':
    main()

I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Upvotes: 0

Nolen Royalty

Reputation: 18663

Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:

>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']

In this case it might make more sense for you to use \b in order to match word boundries.

>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']

Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:

>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Upvotes: 4

sean

Reputation: 3985

Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.

([a-zA-Z0-9\s])*

The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:

([a-zA-Z0-9])([a-zA-Z0-9\s])*

Which simply states that the above sequence must be prefaced with a alphanumeric character.

Hope this is what you were looking for.

Upvotes: 0

Alex W

Reputation: 38253

Here's an awesome Regular Expression tutorial website:

http://regexone.com/

Here's a Regular Expression that will match the examples given:

([a-zA-Z0-9,\. ]+)

Upvotes: 3

Confusing Behaviour of regex in Python

Answers (4)

Related Questions