Reputation: 23
I want to import emails from an mbox format into a Django app. All database tables are Unicode. My problem: sometimes the wrong charset is given, sometimes none at all. What is the best way to deal with these encoding issues?
So far, I merely nest exceptions to try the two most common charsets I receive mails in (utf-8 and iso-8859-1):
if (not message.is_multipart()):
message_charset = message.get_content_charset()
msg.message = message_charset + unicode(message.get_payload(decode=False), message_charset)
else:
for part in message.walk():
if part.get_content_type() == "text/plain":
message_charset = part.get_content_charset()
try:
msg.message = message_charset + unicode(part.get_payload(decode=False), message_charset)
except(UnicodeDecodeError):
try:
msg.message = message_charset + unicode(part.get_payload(decode=False), "utf-8")
except(UnicodeDecodeError):
msg.message = message_charset + unicode(part.get_payload(decode=False), "iso-8859-1")
Is there a better, more robust way?
Thanks!
Upvotes: 2
Views: 476
Reputation: 82934
I'm sorry but your strategy is WRONG.
Firstly, there are encodings that were deliberately designed to fly under the 7-bit ASCII radar so that they could be used in early email systems. The Chinese HZ
encoding is little used these days but Japanese email seems to use ISO-2022-JP
quite frequently. Both of those would be wrongly interpreted as ASCII if you tried that first; your current strategy would wrongly interpret them as UTF-8. It would also interpret restricted (all chars < U+0080) UTF-16 text as UTF-8.
Secondly, ISO-8859-1
maps each of all 256 possible bytes to a Unicode character. random_garbage.decode('iso-8859-1')
will never raise an exception. In other words, anything that fails the UTF-8 test will be interpreted as 'ISO-8859-1' by your strategy.
Do what the man said: use chardet
right from the start. It knows in what order the tests should be done.
Upvotes: 0
Reputation: 281515
You could ask the excellent chardet library to guess the encoding.
"Character encoding auto-detection in Python 2 and 3. As smart as your browser. Open source."
Upvotes: 1