Jabb
Jabb

Reputation: 3502

Exclude matching the period character in a [\W\d]+ regex

I'd like to remove all from a string except alphabetic characters and periods. I made the below function in python. How would I extend the regex so periods are NOT stripped from the string? This needs to work for unicode strings.

def normalize(self, text):
    text = re.sub(ur"(?u)[\W\d]+", ' ', text)
    print text
    return text

Upvotes: 1

Views: 2537

Answers (1)

yurib
yurib

Reputation: 8147

change the semantics from 'strip everything in this group' to 'strip everything that's not in this group' and use:

text = re.sub(ur"(?u)[^a-zA-Z\.]+", ' ', text)

update

i don't think the above mentioned solution will work with all unicode alphabet.
the answers here offer alternative modules to the builtin re that support unicode letter groups.

another option is combining the two approaches:

>>> text = '1234abcd.à!@#$'
>>> re.sub(ur'(?u)([^\w\.]|\d)+',' ',text)
' abcd.\xc3 '

Upvotes: 5

Related Questions