Reputation: 16781
I'm trying to write a simple Python script which takes a text file as an input, deletes every non-literal character, and writes the output in another file. Normally I would have done two ways:
re.sub
to replace every non letter character with empty stringsstring.lowercase
But this time the text is The Divine Comedy in Italian (I'm Italian), so there are some Unicode characters like
èéï
and some others. I wrote # -*- coding: utf-8 -*-
as the first line of the script, but what I got is that Python doesn't signal errors when Unicode chars are written inside the script.
Then I tried to include Unicode chars in my regular expression, writing them as, for example:
u'\u00AB'
and it seems to work, but Python, when reading input from a file, doesn't rewrite what it read the same way it read it. For example, some characters get converted into square root symbol.
What should I do?
Upvotes: 1
Views: 297
Reputation: 120486
unicodedata.category(unichr)
will return the category of that code-point.
You can find a description of the categories at unicode.org but the ones relevant to you are the L, N, P, Z and maybe S groups:
Lu Uppercase_Letter an uppercase letter Ll Lowercase_Letter a lowercase letter Lt Titlecase_Letter a digraphic character, with first part uppercase Lm Modifier_Letter a modifier letter Lo Other_Letter other letters, including syllables and ideographs ...
You might also want to normalize your string first so that diacriticals that can attach to letters do so:
unicodedata.normalize(form, unistr)
Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.
Putting all this together:
file_bytes = ... # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
'Ll', 'Lu', 'Lt', 'Lm', 'Lo', # Letters
'Nd', 'Nl', # Digits
'Po', 'Ps', 'Pe', 'Pi', 'Pf', # Punctuation
'Zs' # Breaking spaces
])
filtered_text = ''.join(
[ch for ch in normalized_text
if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8') # ready to be written to a file
Upvotes: 2
Reputation: 361
import codecs
f = codecs.open('FILENAME', encoding='utf-8')
for line in f:
print repr(line)
print line
1. Will Give you Unicode Formation
2. Will Give you as per written in your file.
Hopefully It will Help you :)
Upvotes: 0