Python email.header.decode_header fails for multiline headers

I'm building a system that reads emails from a gmail account and fetches the subjects, using Python's imaplib and email modules. Sometimes, emails received from a hotmail account have line breaks in their headers, for instance:

In [4]: message['From']
Out[4]: '=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>'

If I try to decode that header, it does nothing:

In [5]: email.header.decode_header(message['From'])
Out[5]: [('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>', None)]

However, if I replace the line break and tab with a space, it works:

In [6]: email.header.decode_header(message['From'].replace('\r\n\t', ' '))
Out[6]: [('isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), ('<[email protected]>', None)]

Is this a bug in decode_header? If not, I would like to know what other special cases like this I should be aware of.

Upvotes: 7

Views: 1291

Answers (2)

Benjy Malca
Benjy Malca

Reputation: 637

This error is still happening in some Python 2.7 versions, so the following workaround could be used:

>>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>'.replace('\r\n\t', ' '))
[('isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), ('<[email protected]>', None)]

It replaces the CLRF and the tab feed for a whitespace. With this, decode_header will parse correctly the header.

Upvotes: 2

Robᵩ
Robᵩ

Reputation: 168626

It is a bug in decode_header, which bug is present in python2.7 and fixed in python3.3.

>>> sys.version_info
sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
>>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>')
[(b'isabel mar\xc3\xada tocino garc\xc3\xada', 'utf-8'), (b'<[email protected]>', None)]

vs

>>> sys.version_info
sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0)
>>> email.header.decode_header('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>')
[('=?utf-8?B?aXNhYmVsIG1hcsOtYSB0b2Npbm8gZ2FyY8OtYQ==?=\r\n\t<[email protected]>', None)]

Upvotes: 5

Related Questions