Reputation: 3502
I'd like to remove all from a string except alphabetic characters and periods. I made the below function in python. How would I extend the regex so periods are NOT stripped from the string? This needs to work for unicode strings.
def normalize(self, text):
text = re.sub(ur"(?u)[\W\d]+", ' ', text)
print text
return text
Upvotes: 1
Views: 2537
Reputation: 8147
change the semantics from 'strip everything in this group' to 'strip everything that's not in this group' and use:
text = re.sub(ur"(?u)[^a-zA-Z\.]+", ' ', text)
update
i don't think the above mentioned solution will work with all unicode alphabet.
the answers here offer alternative modules to the builtin re
that support unicode letter groups.
another option is combining the two approaches:
>>> text = '1234abcd.à!@#$'
>>> re.sub(ur'(?u)([^\w\.]|\d)+',' ',text)
' abcd.\xc3 '
Upvotes: 5