Reputation: 453
Given this text "hey a2a 3beauty hou\se heyYou2", I would like to keep only words that start with alphabeth and continue with a-z, or A-Z, or numbers. So this would be my desired output: " hey a2a heyYou2".
My solution so far passes through text.split() function:
text = "hey a2a 3beauty hou\se heyYou2"
text = text.split()
text = [w for w in text if re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w) is not None]
' '.join(text)
Out[55]: 'hey a2a heyYou2'
Is there a fast, more efficient, way I can achieve this using regex, without splitting the text into a list of words?
Upvotes: 1
Views: 557
Reputation: 626927
You may use a single re.sub
call with the following regex:
\s*(?<!\S)(?![a-zA-Z][a-zA-Z0-9]*(?!\S))\S+
See the regex demo
Details
\s*
- 0+ whitespaces(?<!\S)
- a leading whitespace boundary(?![a-zA-Z][a-zA-Z0-9]*(?!\S))
- a negative lookahead that fails the match if, immediately to the right of the current location, there are
[a-zA-Z]
- a letter[a-zA-Z0-9]*
- 0 or more alphanumeric chars(?!\S)
- a trailing whitespace boundary\S+
- one or more non-whitespace charsimport re
text = "hey a2a 3beauty hou\se heyYou2"
print(re.sub(r"\s*(?<!\S)(?![a-zA-Z][a-zA-Z0-9]*(?!\S))\S+", "", text))
# => hey a2a heyYou2
Upvotes: 3