Reputation: 1906

select words excluding some specific ones - regular expression

I am trying to match words from a sentence excluding the ones start with @.

The original pattern which does not exclude the words starting with @ is the following:

>>> import re
>>> token_pattern_o='(?u)\\b\\w\\w+\\b'
>>> re.search(token_pattern_o, "@mutt")
<re.Match object; span=(1, 5), match='mutt'>

Now I am just adding a negative lookahead for the exclusion:

>>> token_pattern = '(?u)\\b^(?!@)\\w\\w+\\b'
>>> re.search(token_pattern, "#mutt")
>>> re.search(token_pattern, "@mutt")
>>> re.search(token_pattern, "mutt")
<re.Match object; span=(0, 4), match='mutt'>
>>> re.search(token_pattern, "_mutt")
<re.Match object; span=(0, 5), match='_mutt'>

The issue is, it is excluding every word starting with any special character.

Is there a way to achieve what I am trying to achieve?

Upvotes: 0

Answers (3)

The fourth bird

Reputation: 163267

Another option is to match a single word charactere, and assert what is on the left is not an @.

If that is the case, match 1+ word chars and use word boundaries at the beginning and the end of the pattern.

(?u)\b\w(?<!@\w)\w+\b

In parts

(?u) Inline flag for unicode (or use re.U)
\b Word boundary
\w Match a word char
(?<! Negative lookbehind, assert what is directly on the left is not
- @\w Match @ and a single word char
) Close lookbehind
\w+ Match 1+ word chars
\b Word boundary

Regex demo

Upvotes: 0

John Dahl

Reputation: 31

Are you trying to remove the character or exclude the entire word?

import re

patt = re.compile(r'[^@]\w*')

print(patt.search('mutt'))
print(patt.search('#mutt'))
print(patt.search('@mutt'))
print(patt.search('%mutt'))
print(patt.search('^mutt'))

will give this output:

<re.Match object; span=(0, 4), match='mutt'>
<re.Match object; span=(0, 5), match='#mutt'>
<re.Match object; span=(1, 5), match='mutt'>
<re.Match object; span=(0, 5), match='%mutt'>
<re.Match object; span=(0, 5), match='^mutt'>

Changing the pattern to:

patt = re.compile(r'[^@]?\w*')

will provide this output:

<re.Match object; span=(0, 4), match='mutt'>
<re.Match object; span=(0, 5), match='#mutt'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 5), match='%mutt'>
<re.Match object; span=(0, 5), match='^mutt'>

Upvotes: 0

Jordan Brière

Reputation: 1055

I believe you are looking for the following instead:

token_pattern = '(?u)\\b(?<!@)\\w\\w+\\b'

That said, please do me a favour:

token_pattern = r'(?u)\b(?<!@)\w\w+\b'

Upvotes: 1

select words excluding some specific ones - regular expression

Answers (3)

Related Questions