David Moreno García
David Moreno García

Reputation: 4523

Replace accent marks preserving special characters

I need to clean up some string removing accent marks and I do this using the next code:

text = ''.join(c for c in unicodedata.normalize('NFKD', text)
                       if unicodedata.category(c) != 'Mn')

The problem with this piece of code is that also cleans up 'ñ' and 'ç'. I need to preserve this characters to my program to work so, how could I do this?

My first idea was to replace those characters in the original string to something else and after normalise, replace them back, but I ended up with an ugly and time consuming code. Any better idea?

Upvotes: 0

Views: 984

Answers (1)

abarnert
abarnert

Reputation: 365707

As far as Unicode is concerned, the tilde on an ñ and the cedilla on a ç are accent marks, just like the acute on a é.

In particular, the decomposed tilde and cedilla are in the same character class (nonspacing marks) as the decomposed acute. So, if you use that character class to decide what to remove, they'll get removed.

The obvious solution is to code the exceptions manually: drop all characters in the class except the ones you want to keep. Like:

good_accents = {
    u'\N{COMBINING TILDE}',
    u'\N{COMBINING CEDILLA}'
}

# ...

text = ''.join(c for c in unicodedata.normalize('NFKD', text)
                       if (unicodedata.category(c) != 'Mn'
                           or c in good_accents))

If you're on an old enough Python that it doesn't do \N escapes (or if I got them wrong and you can't look them up for yourself), you can always use u'\u0303', u'\u0327' with a comment instead.

Upvotes: 2

Related Questions