ffledgling
ffledgling

Reputation: 12140

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python. I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)

Eg.

I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).


I'm tried:

Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?


P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

Upvotes: 2

Views: 169

Answers (4)

ChipJust
ChipJust

Reputation: 1416

Maybe this will help:

import re

source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one  followed by this one
"""

re_sentence = re.compile(r'[^ \n.].*?(\.|\n|  +)')

def main():
    i = 0
    for s in re_sentence.finditer(source):
        print "%d:%s" % (i, s.group(0))
        i += 1

if __name__ == '__main__':
    main()

I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Upvotes: 0

Nolen Royalty
Nolen Royalty

Reputation: 18633

Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:

>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']

In this case it might make more sense for you to use \b in order to match word boundries.

>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']

Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:

>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Upvotes: 4

sean
sean

Reputation: 3985

Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.

([a-zA-Z0-9\s])*

The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:

([a-zA-Z0-9])([a-zA-Z0-9\s])*

Which simply states that the above sequence must be prefaced with a alphanumeric character.

Hope this is what you were looking for.

Upvotes: 0

Alex W
Alex W

Reputation: 38183

Here's an awesome Regular Expression tutorial website:

http://regexone.com/

Here's a Regular Expression that will match the examples given:

([a-zA-Z0-9,\. ]+)

Upvotes: 3

Related Questions