andreSmol
andreSmol

Reputation: 1038

Python regex positive look ahead

I have the following regex that is supposed to find sequence of words that are ended with a punctuation. The look ahead function assures that after the match there is a space and a capital letter or digit.

pat1 = re.compile(r"\w.+?[?.!](?=\s[A-Z\d])"

What is the function of the following lookahead?

pat2 = re.compile(r"\w.+?[?.!](?=\s+[A-Z\d])"

Is Python 3.2 supporting variable lookahead (\s+)? I do not get any error. Furthermore I cannot see any differences in both patterns. Both seem to work the same regardless the number of blanks that I have. Is there an explanation for the purpose of the \s+ in the look ahead?

Upvotes: 3

Views: 5736

Answers (2)

FailedDev
FailedDev

Reputation: 26940

The difference is that the first lookahead expects exactly one whitespace character before the digit or capital letter while the second one expects at least one whitespace character but as many as possible.

The + is called a quantifier. It means 1 to n as many as possible.

To recap

\s (Exactly one whitespace character allowed. Will fail without it or with more than one.)
\s+ (At least one but maybe more whitespaces allowed.)

Further studying.

I have multiple blanks, the \w.+? continues to match the blanks until the last blank before the capital letter

To answer this comment please consider :

What does \w.+? actually matches?

A single word character [a-zA-Z0-9_] followed by at least one "any" character(except newline) but with the lazy quantifier +?. So in your case, it leaves one space so that the lookahead later matches. Therefore you consume all the blanks except one. This is why you see them at your output.

Upvotes: 2

Stefano
Stefano

Reputation: 18540

I'm not really sure what you are tying to achieve here.

Sequence of words ended by a punctuation can be matched with something like:

re.findall(r'([\w\s]*[\?\!\.;])', s)

the lookahead requires another string to follow?

In any case:

  • \s requires one and only one space;
  • \s+ requires at least one space.

And yes, the lookahead accepts the "+" modifier even in python 2.x

The same as before but with a lookahead:

re.findall(r'([\w\s]*[\?\!\.;])(?=\s\w)', s)

or

re.findall(r'([\w\s]*[\?\!\.;])(?=\s+\w)', s)

you can try them all on something like:

s='Stefano ciao.   a domani. a presto;'

Depending on your strings, the lookahead might be necessary or not, and might or might not change to have "+" more than one space option.

Upvotes: 2

Related Questions