Forinstance
Forinstance

Reputation: 453

python regex keep only words that start with alphabet and continues with [a-zA-Z0-9]

Given this text "hey a2a 3beauty hou\se heyYou2", I would like to keep only words that start with alphabeth and continue with a-z, or A-Z, or numbers. So this would be my desired output: " hey a2a heyYou2".

My solution so far passes through text.split() function:

text = "hey a2a 3beauty hou\se heyYou2"
text = text.split()
text = [w for w in text if re.search(r"^[a-zA-Z][a-zA-Z0-9]*$", w) is not None]
' '.join(text)

Out[55]: 'hey a2a heyYou2'

Is there a fast, more efficient, way I can achieve this using regex, without splitting the text into a list of words?

Upvotes: 1

Views: 557

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

You may use a single re.sub call with the following regex:

\s*(?<!\S)(?![a-zA-Z][a-zA-Z0-9]*(?!\S))\S+

See the regex demo

Details

  • \s* - 0+ whitespaces
  • (?<!\S) - a leading whitespace boundary
  • (?![a-zA-Z][a-zA-Z0-9]*(?!\S)) - a negative lookahead that fails the match if, immediately to the right of the current location, there are
    • [a-zA-Z] - a letter
    • [a-zA-Z0-9]* - 0 or more alphanumeric chars
    • (?!\S) - a trailing whitespace boundary
  • \S+ - one or more non-whitespace chars

Python code demo:

import re
text = "hey a2a 3beauty hou\se heyYou2"
print(re.sub(r"\s*(?<!\S)(?![a-zA-Z][a-zA-Z0-9]*(?!\S))\S+", "", text))
# => hey a2a heyYou2

Upvotes: 3

Related Questions