Reputation: 135
I am having the following sample code where i am trying to match all word instances which are starting and ending with an underscore (either single or double one).
import re
test = ['abc text_ abc',
'abc _text abc',
'abc text_textUnderscored abc',
'abc :_text abc',
'abc _text_ abc',
'abc __text__ abc',
'abc _text_: abc',
'abc (-_-) abc']
test_str = ' '.join(test)
print(re.compile('(_\\w+\\b)').split(test_str))
I have already tried the following regex and it seems too strong (should match only _text_
and __text__
).
Output: ['abc text_ abc abc ', '_text', ' abc abc text', '_textUnderscored', ' abc abc :', '_text', ' abc abc ', '_text_', ' abc abc ', '__text__', ' abc abc ', '_text_', ': abc abc (-_-) abc']
Can you suggest a better approach (preferably with single regex pattern and usage of re.split
method)?
Upvotes: 5
Views: 2499
Reputation: 626870
If you mean to match any chunks of word chars (letters, digits and underscores) that are not preceded nor followed with non-word chars (chars other than letters, digits and underscores) and of any length (even 1, _
) you may use
r'\b_(?:\w*_)?\b'
with re.findall
. See the regex demo.
If you do not want to match single-char words (i.e. _
) you need to remove the optional non-capturing group, and use r'\b_\w*_\b'
.
If you need to match at least 3 char words, also replace *
(zero or more repetitions) with +
(one or more occurrences) .
If you consider words as whole words only when they are at the start/end of string or are followed/preceded with whitespaces, replace \b...\b
with (?<!\S)...(?!\S)
:
r'(?<!\S)_\w*_(?!\S)'
Details
\b
- a word boundary, there must be start of string or a non-word char right before_
- an underscore(?:\w*_)?
- an optional non-capturing group matching 1 or 0 occurrences of
\w*
- 0+ word chars (letters, digits, _
s) (thanks to this optional group, even _
word will be found)_
- an underscore \b
- a word boundary, there must be end of string or a non-word char right after(?<!\S)
- left whitespace boundary(?!\S)
- right whitespace boundarySee the Python demo:
rx = re.compile(r'\b_(?:\w*_)?\b')
print(rx.findall(test_str))
# => ['_text_', '__text__']
Upvotes: 3
Reputation: 331
You don't even need to use regex, the most efficient approach would be to split the string into words and then check whether or not it starts and ends with an underscore.
def get_underscored(text):
for word in text.split():
if word.startswith("_") and word.endswith("_"):
yield word
test = ['abc text_ abc',
'abc _text abc,',
'abc text_textUnderscored abc',
'abc :_text abc',
'abc _text_ abc',
'abc __text__ abc']
test_str = ' '.join(test)
print(list(get_underscored(test_str)))
Result is ['_text_', '__text__']
.
Granted this approach doesn't scale as well as regex on larger inputs, it works orders of magnitude faster on smaller ones.
Upvotes: 2