Reputation: 409
I am parsing E-Mails with the Python email
module.
If I parse it with the Python E-Mail parser, it does not remove the tab in front of the header items:
from email.parser import Parser
from email.policy import default
testmail = """Date: Wed, 26 Jan 2022 10:45:29 +0100
Message-ID:
<123123123123123123123123123123123123123.testinst.themultiverse.com>
Subject:
=?iso-8859-1?Q?Auftragsbest=E4tigung_blablabla?=
=?iso-8859-1?Q?_one nice thing?=
Content Body Whatnot"""
message = Parser(policy=default).parsestr(testmail)
print(repr(message["Message-Id"]))
print(repr(message["Subject"]))
results in:
'\t<123123123123123123123123123123123123123.testinst.themultiverse.com>'
'\tAuftragsbestätigung blablabla one nice thing'
I have tried the different policies of the email
parser, but I do not manage to remove the tab in the beginning. I saw the header_source_parse
method of the EmailPolicy
class does strip the whitespace, but only in combination with a space in the beginning.
<pythonlib>/email/policy.py
:
[...]
value = value.lstrip(' \t') + ''.join(sourcelines[1:])
[...]
Not sure if that is intended behavior or a bug.
My question now: Is there a way in the standard library to do this, or do I need to write a custom policy? The E-Mails are unchanged from an IMAP Server (exchange) and it feels strange that the standard tools do not cover this.
Upvotes: 1
Views: 183
Reputation: 148880
Something let me think that the message is not strictly conformant to RFC5322.
We can see at 3.2.2. Folding White Space and Comments:
However, where CFWS occurs in this specification, it MUST NOT be inserted in such a way that any line of a folded header field is made up entirely of WSP characters and nothing else.
But for the Subject
and Message-ID
fields, the first line will only contain spaces before the first newline.
IIUC, it correspond to an obsolete syntax, because we find at 4. Obsolete Syntax:
Another key difference between the obsolete and the current syntax is that the rule in section 3.2.2 regarding lines composed entirely of white space in comments and folding white space does not apply.
The doc for EmailPolicy from Python Standard Library is even more explicit on what happens:
header_source_parse(sourcelines)
The name is parsed as everything up to the ‘:’ and returned unmodified. The value is determined by stripping leading whitespace off the remainder of the first line, joining all subsequent lines together, and stripping any trailing carriage return or linefeed characters.
As the tab occurs on the second line, it is not stripped.
I am unsure whether this interpretation is correct, but a possible workaround is to specialize a subclass or EmailPolicy
to strip that initial line:
class ObsoletePolicy(email.policy.EmailPolicy):
def header_source_parse(self, sourcelines):
header, value = super().header_source_parse(sourcelines)
value = value.lstrip(' \t\r\n')
return header, value
If you use:
message = Parser(policy=ObsoletePolicy()).parsestr(testmail)
you will now get for print(repr(message['Subject']))
:
'Auftragsbestätigung blablabla one nice thing'
Upvotes: 1