uphill
uphill

Reputation: 409

Parsing Email Headers Tabs

I am parsing E-Mails with the Python email module.

If I parse it with the Python E-Mail parser, it does not remove the tab in front of the header items:

from email.parser import Parser
from email.policy import default

testmail = """Date: Wed, 26 Jan 2022 10:45:29 +0100
Message-ID:
    <123123123123123123123123123123123123123.testinst.themultiverse.com>
Subject:
    =?iso-8859-1?Q?Auftragsbest=E4tigung_blablabla?=
 =?iso-8859-1?Q?_one nice thing?=

Content Body Whatnot"""


message = Parser(policy=default).parsestr(testmail)

print(repr(message["Message-Id"]))
print(repr(message["Subject"]))

results in:

'\t<123123123123123123123123123123123123123.testinst.themultiverse.com>'
'\tAuftragsbestätigung blablabla one nice thing'

I have tried the different policies of the email parser, but I do not manage to remove the tab in the beginning. I saw the header_source_parse method of the EmailPolicy class does strip the whitespace, but only in combination with a space in the beginning.

<pythonlib>/email/policy.py:

[...]
        value = value.lstrip(' \t') + ''.join(sourcelines[1:])
[...]

Not sure if that is intended behavior or a bug.

My question now: Is there a way in the standard library to do this, or do I need to write a custom policy? The E-Mails are unchanged from an IMAP Server (exchange) and it feels strange that the standard tools do not cover this.

Upvotes: 1

Views: 183

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 148880

Something let me think that the message is not strictly conformant to RFC5322.

We can see at 3.2.2. Folding White Space and Comments:

However, where CFWS occurs in this specification, it MUST NOT be inserted in such a way that any line of a folded header field is made up entirely of WSP characters and nothing else.

But for the Subject and Message-ID fields, the first line will only contain spaces before the first newline. IIUC, it correspond to an obsolete syntax, because we find at 4. Obsolete Syntax:

Another key difference between the obsolete and the current syntax is that the rule in section 3.2.2 regarding lines composed entirely of white space in comments and folding white space does not apply.

The doc for EmailPolicy from Python Standard Library is even more explicit on what happens:

header_source_parse(sourcelines)

The name is parsed as everything up to the ‘:’ and returned unmodified. The value is determined by stripping leading whitespace off the remainder of the first line, joining all subsequent lines together, and stripping any trailing carriage return or linefeed characters.

As the tab occurs on the second line, it is not stripped.

I am unsure whether this interpretation is correct, but a possible workaround is to specialize a subclass or EmailPolicy to strip that initial line:

class ObsoletePolicy(email.policy.EmailPolicy):
    def header_source_parse(self, sourcelines):
        header, value = super().header_source_parse(sourcelines)
        value = value.lstrip(' \t\r\n')
        return header, value

If you use:

message = Parser(policy=ObsoletePolicy()).parsestr(testmail)

you will now get for print(repr(message['Subject'])):

'Auftragsbestätigung blablabla one nice thing'

Upvotes: 1

Related Questions