sai kumar
sai kumar

Reputation: 33

regex include special character in pattern for re.finditer

im trying to get a start and stop index number of a word inside a string using re.finditer. for most of it my pattern working fine, but for a word with special character my regex giving me an error

Problem:

I tried:

a = " we have c++ and c#"
pattern = ['c#','c++']
regex = re.compile(r'\b(' + '|'.join(pattern) + r')\b')
out = [ (m.start(0), m.end(0)) for m in regex.finditer(a)]

Current Output:

error: multiple repeat at position x

Expected Output :

[(9,12),(17,19)]

for most of case my pattern working fine but word with special character I'm having a problem. I'm not much familiar with regex, any one please help out of it, Thanks!

Upvotes: 3

Views: 368

Answers (1)

Anurag Wagh
Anurag Wagh

Reputation: 1086

Code:

a = " we have c++ and c#"
pattern = [ r'\b{}(?=\s|$)'.format(re.escape(s)) for s in ['c#','c++']]
regex = re.compile('|'.join(pattern))
[ (m.start(0), m.end(0)) for m in regex.finditer(a)]

Details:

The first problem is, special characters; you can escape special characters manually

'c\\+\\+', 'c\\#\\#']

or to simplify you can use re.escape, it would do that work for you

re.escape('c++, c##')

The second problem is, word boundaries; they won't behave the same way for special characters as they would for alphanumeric characters e.g. \bfoo\b

To quote from python docs

\b word boundary

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

To make this work, you can use positive lookahead assertion

r'\b{}(?=\s|$)'

It looks for a whitespace (\s) character or end of the sentence ($) after your pattern

Upvotes: 3

Related Questions