Alex V
Alex V

Reputation: 3644

Error parsing emails using Python's email module when the encoding is in shift_jis

I am getting an error that says "UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 2-3: illegal multibyte sequence" when I try to use my email parser to decode a shift_jis encoded email and convert it to unicode. The code and email can be found below:

import email.header
import base64
import sys
import email

def getrawemail():
    line = ' '
    raw_email = ''
    while line:
        line = sys.stdin.readline()
        raw_email += line
    return raw_email

def getheader(subject, charsets):
    for i in charsets:
        if isinstance(i, str):
            encoding = i
            break
    if subject[-2] == "?=":
        encoded = subject[5 + len(encoding):len(subject) - 2]
    else:
        encoded = subject[5 + len(encoding):]
    return (encoding, encoded)

def decodeheader((encoding, encoded)):
    decoded = base64.b64decode(encoded)
    decoded = unicode(decoded, encoding)
    return decoded

raw_email = getrawemail()
msg = email.message_from_string(raw_email)
subject = decodeheader(getheader(msg["Subject"], msg.get_charsets()))
print subject

Email: http://pastebin.com/L4jAkm5R

I have read on another Stack Overflow question that this may be related to a difference between how Unicode and shift_jis are encoded (they referenced this Microsoft Knowledge Base article). If anyone knows what in my code could be causing it to not work, or if this is even reasonably fixable, I would very much appreciate finding out how.

Upvotes: 1

Views: 1412

Answers (1)

unutbu
unutbu

Reputation: 879749

Starting with this string:

In [124]: msg['Subject']
Out[124]: '=?ISO-2022-JP?B?GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'

=?ISO-2022-JP?B? means the string is ISO-2022-JP encoded, then base64 encoded.

In [125]: msg['Subject'].lstrip('=?ISO-2022-JP?B?')
Out[125]: 'GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'

Unfortunately, trying to reverse that process results in an error:

In [126]: base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?'))
TypeError: Incorrect padding

Reading this SO answer lead me to try adding '?=' to the end of the string:

In [130]: print(base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?')+'?=').decode('ISO-2022-JP'))
貴方にとても大切なお知らせがあり

According to google translate, this may be translated as "You know there is a very important".

So it appears the subject line has been truncated.

Upvotes: 1

Related Questions