azawalich
azawalich

Reputation: 135

Regex to match words both starting and ending with underscore with Python 3

I am having the following sample code where i am trying to match all word instances which are starting and ending with an underscore (either single or double one).

import re
test = ['abc text_ abc',
'abc _text abc',
'abc text_textUnderscored abc',
'abc :_text abc', 
'abc _text_ abc', 
'abc __text__ abc',
'abc _text_: abc',
'abc (-_-) abc']
test_str = ' '.join(test)
print(re.compile('(_\\w+\\b)').split(test_str))

I have already tried the following regex and it seems too strong (should match only _text_and __text__).

Output: ['abc text_ abc abc ', '_text', ' abc abc text', '_textUnderscored', ' abc abc :', '_text', ' abc abc ', '_text_', ' abc abc ', '__text__', ' abc abc ', '_text_', ': abc abc (-_-) abc']

Can you suggest a better approach (preferably with single regex pattern and usage of re.split method)?

Upvotes: 5

Views: 2499

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

If you mean to match any chunks of word chars (letters, digits and underscores) that are not preceded nor followed with non-word chars (chars other than letters, digits and underscores) and of any length (even 1, _) you may use

r'\b_(?:\w*_)?\b'

with re.findall. See the regex demo.

If you do not want to match single-char words (i.e. _) you need to remove the optional non-capturing group, and use r'\b_\w*_\b'.

If you need to match at least 3 char words, also replace * (zero or more repetitions) with + (one or more occurrences) .

If you consider words as whole words only when they are at the start/end of string or are followed/preceded with whitespaces, replace \b...\b with (?<!\S)...(?!\S):

r'(?<!\S)_\w*_(?!\S)'

See another regex demo

Details

  • \b - a word boundary, there must be start of string or a non-word char right before
  • _ - an underscore
  • (?:\w*_)? - an optional non-capturing group matching 1 or 0 occurrences of
    • \w* - 0+ word chars (letters, digits, _s) (thanks to this optional group, even _ word will be found)
    • _ - an underscore
  • \b - a word boundary, there must be end of string or a non-word char right after
  • (?<!\S) - left whitespace boundary
  • (?!\S) - right whitespace boundary

See the Python demo:

rx = re.compile(r'\b_(?:\w*_)?\b')
print(rx.findall(test_str))
# => ['_text_', '__text__']

Upvotes: 3

altskop
altskop

Reputation: 331

You don't even need to use regex, the most efficient approach would be to split the string into words and then check whether or not it starts and ends with an underscore.

def get_underscored(text):
    for word in text.split():
        if word.startswith("_") and word.endswith("_"):
            yield word

test = ['abc text_ abc',
'abc _text abc,',
'abc text_textUnderscored abc',
'abc :_text abc',
'abc _text_ abc',
'abc __text__ abc']
test_str = ' '.join(test)
print(list(get_underscored(test_str)))

Result is ['_text_', '__text__'].

Granted this approach doesn't scale as well as regex on larger inputs, it works orders of magnitude faster on smaller ones.

Upvotes: 2

Related Questions