whatyouhide
whatyouhide

Reputation: 16781

Stripping a unicode text of whatever is not a character

I'm trying to write a simple Python script which takes a text file as an input, deletes every non-literal character, and writes the output in another file. Normally I would have done two ways:

But this time the text is The Divine Comedy in Italian (I'm Italian), so there are some Unicode characters like

èéï

and some others. I wrote # -*- coding: utf-8 -*- as the first line of the script, but what I got is that Python doesn't signal errors when Unicode chars are written inside the script.

Then I tried to include Unicode chars in my regular expression, writing them as, for example:

u'\u00AB'

and it seems to work, but Python, when reading input from a file, doesn't rewrite what it read the same way it read it. For example, some characters get converted into square root symbol.

What should I do?

Upvotes: 1

Views: 297

Answers (2)

Mike Samuel
Mike Samuel

Reputation: 120486

unicodedata.category(unichr) will return the category of that code-point.

You can find a description of the categories at unicode.org but the ones relevant to you are the L, N, P, Z and maybe S groups:

Lu    Uppercase_Letter    an uppercase letter
Ll    Lowercase_Letter    a lowercase letter
Lt    Titlecase_Letter    a digraphic character, with first part uppercase
Lm    Modifier_Letter a modifier letter
Lo    Other_Letter    other letters, including syllables and ideographs
...

You might also want to normalize your string first so that diacriticals that can attach to letters do so:

unicodedata.normalize(form, unistr)

Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

Putting all this together:

file_bytes = ...   # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
    'Ll', 'Lu', 'Lt', 'Lm', 'Lo',  # Letters
    'Nd', 'Nl',                    # Digits
    'Po', 'Ps', 'Pe', 'Pi', 'Pf',  # Punctuation
    'Zs'                           # Breaking spaces
])
filtered_text = ''.join(
    [ch for ch in normalized_text
     if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8')  # ready to be written to a file

Upvotes: 2

Harsh Kothari
Harsh Kothari

Reputation: 361

import codecs
f = codecs.open('FILENAME', encoding='utf-8')
for line in f:
    print repr(line)
    print line

1. Will Give you Unicode Formation
2. Will Give you as per written in your file.

Hopefully It will Help you :)

Upvotes: 0

Related Questions