Mrinal
Mrinal

Reputation: 1906

select words excluding some specific ones - regular expression

I am trying to match words from a sentence excluding the ones start with @.

The original pattern which does not exclude the words starting with @ is the following:

>>> import re
>>> token_pattern_o='(?u)\\b\\w\\w+\\b'
>>> re.search(token_pattern_o, "@mutt")
<re.Match object; span=(1, 5), match='mutt'>

Now I am just adding a negative lookahead for the exclusion:

>>> token_pattern = '(?u)\\b^(?!@)\\w\\w+\\b'
>>> re.search(token_pattern, "#mutt")
>>> re.search(token_pattern, "@mutt")
>>> re.search(token_pattern, "mutt")
<re.Match object; span=(0, 4), match='mutt'>
>>> re.search(token_pattern, "_mutt")
<re.Match object; span=(0, 5), match='_mutt'>

The issue is, it is excluding every word starting with any special character.

Is there a way to achieve what I am trying to achieve?

Upvotes: 0

Views: 91

Answers (3)

The fourth bird
The fourth bird

Reputation: 163267

Another option is to match a single word charactere, and assert what is on the left is not an @.

If that is the case, match 1+ word chars and use word boundaries at the beginning and the end of the pattern.

(?u)\b\w(?<!@\w)\w+\b

In parts

  • (?u) Inline flag for unicode (or use re.U)
  • \b Word boundary
  • \w Match a word char
  • (?<! Negative lookbehind, assert what is directly on the left is not
    • @\w Match @ and a single word char
  • ) Close lookbehind
  • \w+ Match 1+ word chars
  • \b Word boundary

Regex demo

Upvotes: 0

John Dahl
John Dahl

Reputation: 31

Are you trying to remove the character or exclude the entire word?

import re

patt = re.compile(r'[^@]\w*')

print(patt.search('mutt'))
print(patt.search('#mutt'))
print(patt.search('@mutt'))
print(patt.search('%mutt'))
print(patt.search('^mutt'))

will give this output:

<re.Match object; span=(0, 4), match='mutt'>
<re.Match object; span=(0, 5), match='#mutt'>
<re.Match object; span=(1, 5), match='mutt'>
<re.Match object; span=(0, 5), match='%mutt'>
<re.Match object; span=(0, 5), match='^mutt'>

Changing the pattern to:

patt = re.compile(r'[^@]?\w*')

will provide this output:

<re.Match object; span=(0, 4), match='mutt'>
<re.Match object; span=(0, 5), match='#mutt'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(0, 5), match='%mutt'>
<re.Match object; span=(0, 5), match='^mutt'>

Upvotes: 0

Jordan Bri&#232;re
Jordan Bri&#232;re

Reputation: 1055

I believe you are looking for the following instead:

token_pattern = '(?u)\\b(?<!@)\\w\\w+\\b'

That said, please do me a favour:

token_pattern = r'(?u)\b(?<!@)\w\w+\b'

Upvotes: 1

Related Questions