HKN
HKN

Reputation: 228

Pulling out valid twitter names using re module in Python

1. Background info

I have string which contains valid and invalid twitter user names as such:

@moondra2017.org,@moondra,Python@moondra,@moondra_python

In the above string, @moondra and @moondra_python are valid usernames. The rest are not.

1.1 Goal

By using \b and/or \B as a part of regex pattern, I need to extract the valid usernames.

P.S I must use \b and/or \B as the part of the regex, that is part of this goal.

2. My Failed Attempt

import re

# (in)valid twitter user names
un1 = '@moondra2017.org' # invalid
un2 = '@moondra'        # << valid, we want this
un3 = 'Python@moondra'   # invalid
un4 = '@moondra_python' # << validwe want this

string23 = f'{un1},{un2},{un3},{un4}'

pattern = re.compile(r'(?:\B@\w+\b(?:[,])|\B@\w+\b)')  # ??
print('10:', re.findall(pattern, string23))  # line 10

2.1 Observed: The above code prints:

10: ['@moondra2017', '@moondra,', '@moondra_python'] # incorrect

2.2 Expected:

10: ['@moondra', '@moondra_python'] # correct

Upvotes: 1

Views: 46

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

I will answer assuming that the mentions are always in the format as shown above, comma-separated.

Then, to match the end of a mention, you need to use a comma boundary, (?![^,]) or a less efficient but online tester friendly (?=,|$).

pattern = re.compile(r'\B@\w+\b(?![^,])')
pattern = re.compile(r'\B@\w+\b(?=,|$)')

See the regex demo and the Python demo

Details

  • \B - a non-word boundary, there must be start of string or a non-word char immediately to the left of the current location
  • @ - a @ char
  • \w+ - 1+ word chars (letters, digits or _)
  • \b - a word boundary (the next char should be a non-word char or end of string)
  • (?![^,]) - the next char cannot be a char different from , (so it should be , or end of string).

Upvotes: 2

Related Questions