Reputation: 3644
I am getting an error that says "UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 2-3: illegal multibyte sequence" when I try to use my email parser to decode a shift_jis encoded email and convert it to unicode. The code and email can be found below:
import email.header
import base64
import sys
import email
def getrawemail():
line = ' '
raw_email = ''
while line:
line = sys.stdin.readline()
raw_email += line
return raw_email
def getheader(subject, charsets):
for i in charsets:
if isinstance(i, str):
encoding = i
break
if subject[-2] == "?=":
encoded = subject[5 + len(encoding):len(subject) - 2]
else:
encoded = subject[5 + len(encoding):]
return (encoding, encoded)
def decodeheader((encoding, encoded)):
decoded = base64.b64decode(encoded)
decoded = unicode(decoded, encoding)
return decoded
raw_email = getrawemail()
msg = email.message_from_string(raw_email)
subject = decodeheader(getheader(msg["Subject"], msg.get_charsets()))
print subject
Email: http://pastebin.com/L4jAkm5R
I have read on another Stack Overflow question that this may be related to a difference between how Unicode and shift_jis are encoded (they referenced this Microsoft Knowledge Base article). If anyone knows what in my code could be causing it to not work, or if this is even reasonably fixable, I would very much appreciate finding out how.
Upvotes: 1
Views: 1412
Reputation: 879749
Starting with this string:
In [124]: msg['Subject']
Out[124]: '=?ISO-2022-JP?B?GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'
=?ISO-2022-JP?B?
means the string is ISO-2022-JP encoded, then base64 encoded.
In [125]: msg['Subject'].lstrip('=?ISO-2022-JP?B?')
Out[125]: 'GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'
Unfortunately, trying to reverse that process results in an error:
In [126]: base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?'))
TypeError: Incorrect padding
Reading this SO answer lead me to try adding '?=' to the end of the string:
In [130]: print(base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?')+'?=').decode('ISO-2022-JP'))
貴方にとても大切なお知らせがあり
According to google translate, this may be translated as "You know there is a very important".
So it appears the subject line has been truncated.
Upvotes: 1