Reputation: 1906
I am trying to match words from a sentence excluding the ones start with @
.
The original pattern which does not exclude the words starting with @
is the following:
>>> import re
>>> token_pattern_o='(?u)\\b\\w\\w+\\b'
>>> re.search(token_pattern_o, "@mutt")
<re.Match object; span=(1, 5), match='mutt'>
Now I am just adding a negative lookahead for the exclusion:
>>> token_pattern = '(?u)\\b^(?!@)\\w\\w+\\b'
>>> re.search(token_pattern, "#mutt")
>>> re.search(token_pattern, "@mutt")
>>> re.search(token_pattern, "mutt")
<re.Match object; span=(0, 4), match='mutt'>
>>> re.search(token_pattern, "_mutt")
<re.Match object; span=(0, 5), match='_mutt'>
The issue is, it is excluding every word starting with any special character.
Is there a way to achieve what I am trying to achieve?
Upvotes: 0
Views: 91
Reputation: 163267
Another option is to match a single word charactere, and assert what is on the left is not an @
.
If that is the case, match 1+ word chars and use word boundaries at the beginning and the end of the pattern.
(?u)\b\w(?<!@\w)\w+\b
In parts
(?u)
Inline flag for unicode (or use re.U
)\b
Word boundary\w
Match a word char(?<!
Negative lookbehind, assert what is directly on the left is not
@\w
Match @ and a single word char)
Close lookbehind\w+
Match 1+ word chars\b
Word boundaryUpvotes: 0
Reputation: 31
Are you trying to remove the character or exclude the entire word?
import re
patt = re.compile(r'[^@]\w*')
print(patt.search('mutt'))
print(patt.search('#mutt'))
print(patt.search('@mutt'))
print(patt.search('%mutt'))
print(patt.search('^mutt'))
will give this output:
<re.Match object; span=(0, 4), match='mutt'>
<re.Match object; span=(0, 5), match='#mutt'>
<re.Match object; span=(1, 5), match='mutt'>
<re.Match object; span=(0, 5), match='%mutt'>
<re.Match object; span=(0, 5), match='^mutt'>
Changing the pattern to:
patt = re.compile(r'[^@]?\w*')
will provide this output:
<re.Match object; span=(0, 4), match='mutt'>
<re.Match object; span=(0, 5), match='#mutt'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 5), match='%mutt'>
<re.Match object; span=(0, 5), match='^mutt'>
Upvotes: 0
Reputation: 1055
I believe you are looking for the following instead:
token_pattern = '(?u)\\b(?<!@)\\w\\w+\\b'
That said, please do me a favour:
token_pattern = r'(?u)\b(?<!@)\w\w+\b'
Upvotes: 1