Reputation: 4523
I need to clean up some string removing accent marks and I do this using the next code:
text = ''.join(c for c in unicodedata.normalize('NFKD', text)
if unicodedata.category(c) != 'Mn')
The problem with this piece of code is that also cleans up 'ñ' and 'ç'. I need to preserve this characters to my program to work so, how could I do this?
My first idea was to replace those characters in the original string to something else and after normalise, replace them back, but I ended up with an ugly and time consuming code. Any better idea?
Upvotes: 0
Views: 984
Reputation: 365707
As far as Unicode is concerned, the tilde on an ñ
and the cedilla on a ç
are accent marks, just like the acute on a é
.
In particular, the decomposed tilde and cedilla are in the same character class (nonspacing marks) as the decomposed acute. So, if you use that character class to decide what to remove, they'll get removed.
The obvious solution is to code the exceptions manually: drop all characters in the class except the ones you want to keep. Like:
good_accents = {
u'\N{COMBINING TILDE}',
u'\N{COMBINING CEDILLA}'
}
# ...
text = ''.join(c for c in unicodedata.normalize('NFKD', text)
if (unicodedata.category(c) != 'Mn'
or c in good_accents))
If you're on an old enough Python that it doesn't do \N
escapes (or if I got them wrong and you can't look them up for yourself), you can always use u'\u0303', u'\u0327'
with a comment instead.
Upvotes: 2