Reputation: 33
im trying to get a start and stop index number of a word inside a string using re.finditer. for most of it my pattern working fine, but for a word with special character my regex giving me an error
Problem:
I tried:
a = " we have c++ and c#"
pattern = ['c#','c++']
regex = re.compile(r'\b(' + '|'.join(pattern) + r')\b')
out = [ (m.start(0), m.end(0)) for m in regex.finditer(a)]
Current Output:
error: multiple repeat at position x
Expected Output :
[(9,12),(17,19)]
for most of case my pattern working fine but word with special character I'm having a problem. I'm not much familiar with regex, any one please help out of it, Thanks!
Upvotes: 3
Views: 368
Reputation: 1086
Code:
a = " we have c++ and c#"
pattern = [ r'\b{}(?=\s|$)'.format(re.escape(s)) for s in ['c#','c++']]
regex = re.compile('|'.join(pattern))
[ (m.start(0), m.end(0)) for m in regex.finditer(a)]
Details:
The first problem is, special characters; you can escape special characters manually
'c\\+\\+', 'c\\#\\#']
or to simplify you can use re.escape, it would do that work for you
re.escape('c++, c##')
The second problem is, word boundaries; they won't behave the same way for special characters as they would for alphanumeric characters e.g. \bfoo\b
To quote from python docs
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.
To make this work, you can use positive lookahead assertion
r'\b{}(?=\s|$)'
It looks for a whitespace (\s)
character or end of the sentence ($)
after your pattern
Upvotes: 3