Stripping a unicode text of whatever is not a character

Question

I'm trying to write a simple Python script which takes a text file as an input, deletes every non-literal character, and writes the output in another file. Normally I would have done two ways:

use a regular expression combined with re.sub to replace every non letter character with empty strings
examine every char in every line and write it to the output only if it was in string.lowercase

But this time the text is The Divine Comedy in Italian (I'm Italian), so there are some Unicode characters like

èéï

and some others. I wrote # -*- coding: utf-8 -*- as the first line of the script, but what I got is that Python doesn't signal errors when Unicode chars are written inside the script.

Then I tried to include Unicode chars in my regular expression, writing them as, for example:

u'\u00AB'

and it seems to work, but Python, when reading input from a file, doesn't rewrite what it read the same way it read it. For example, some characters get converted into square root symbol.

What should I do?

Mike Samuel · Accepted Answer

unicodedata.category(unichr) will return the category of that code-point.

You can find a description of the categories at unicode.org but the ones relevant to you are the L, N, P, Z and maybe S groups:

Lu    Uppercase_Letter    an uppercase letter
Ll    Lowercase_Letter    a lowercase letter
Lt    Titlecase_Letter    a digraphic character, with first part uppercase
Lm    Modifier_Letter a modifier letter
Lo    Other_Letter    other letters, including syllables and ideographs
...

You might also want to normalize your string first so that diacriticals that can attach to letters do so:

unicodedata.normalize(form, unistr)

Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

Putting all this together:

file_bytes = ...   # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
    'Ll', 'Lu', 'Lt', 'Lm', 'Lo',  # Letters
    'Nd', 'Nl',                    # Digits
    'Po', 'Ps', 'Pe', 'Pi', 'Pf',  # Punctuation
    'Zs'                           # Breaking spaces
])
filtered_text = ''.join(
    [ch for ch in normalized_text
     if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8')  # ready to be written to a file

Stripping a unicode text of whatever is not a character

Answers (2)

Related Questions