Reputation: 20915

How do I separate words using regex in python while considering words with apostrophes?

I tried separate m's in a python regex by using word boundaries and find them all. These m's should either have a whitespace on both sides or begin/end the string:

r = re.compile("\\bm\\b")
re.findall(r, someString)

However, this method also finds m's within words like I'm since apostrophes are considered to be word boundaries. How do I write a regex that doesn't consider apostrophes as word boundaries?

I've tried this:

r = re.compile("(\\sm\\s) | (^m) | (m$)")
re.findall(r, someString)

but that just doesn't match any m. Odd.

Upvotes: 1

Answers (3)

Lynwood Hines

Reputation: 31

falsetru's answer is almost the equivalent of "\b except apostrophes", but not quite. It will still find matches where a boundary is missing. Using one of falsetru's examples:

>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']

It finds 'm', but there is no occurrence of 'm' in 'mama' that would match '\bm\b'. The first 'm' matches '\bm', but that's as close as it gets.

The regex that implements "\b without apostrophes" is shown below:

(?<=\s)m(?=\s)|^m(?=\s)|(?<=\s)m$|^m$

This will find any of the following 4 cases:

'm' with white space before and after
'm' at beginning followed by white space
'm' at end preceded by white space
'm' with nothing preceding or following it (i.e. just literally the string "m")

Upvotes: 1

beroe

Reputation: 12316

You don't even need look-around (unless you want to capture the m without the spaces), but your second example was inches away. It was the extra spaces (ok in python, but not within a regex) which made them not work:

>>> re.findall(r'\sm\s|^m|m$', "I m a boy")
[' m ']
>>> re.findall(r'\sm\s|^m|m$', "mamam")
['m', 'm']
>>> re.findall(r'\sm\s|^m|m$', "mama")
['m']
>>> re.findall(r'\sm\s|^m|m$', "I'm a boy")
[]
>>> re.findall(r'\sm\s|^m|m$', "I'm a boym")
['m']

Upvotes: 1

falsetru

Reputation: 369224

Using lookaround assertion:

>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I'm a boy")
[]
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I m a boy")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "pm")
['m']

(?=...)

Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, ...

from Regular expression syntax

BTW, using raw string (r'this is raw string'), you don't need to escape \.

>>> r'\s' == '\\s'
True

Upvotes: 3

How do I separate words using regex in python while considering words with apostrophes?

Answers (3)

Related Questions