Reputation: 26366

'utf-8' codec can't encode character '\udcc2': surrogates not allowed

I am using Python 3.6.0b2.

I am parsing a lot of emails. This one particular email is a problem because I cannot print the display name of the email address. Attempting to print the email address display name gives:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in position 30: surrogates not allowed

Here is a test case piece of code that shows how to reproduce the problem:

(venv3.6) mailripper@ip-10-0-0-112:/opt/mailripper$ cat test.py
from email import policy
from email.headerregistry import Address
from email.parser import BytesHeaderParser, BytesParser

email_bytes = b'From: =?utf-8?Q?John_Smith=2C_Prince2=C2=AE=2CPMP=C2=AE=2C_CSM=C2?=\r\n =?utf-8?Q?=AE=2C_ITIL=C2=AE=2C_ISTQB=C2=AE?= <[email protected]>\r\n'
msg = BytesHeaderParser(policy=policy.default).parsebytes(email_bytes)
print(msg['from'])
print(msg['from'].addresses[0].display_name)

Here is the error as generated by the above code:

(venv3.6) mailripper@ip-10-0-0-112:/opt/mailripper$ python test.py
"John Smith, Prince2®,PMP®, CSM� �, ITIL®, ISTQB®" <[email protected]>
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    print(msg['from'].addresses[0].display_name)
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in position 30: surrogates not allowed

And here is the display name as shown in the OSX email client, which appears to be able to parse it OK (this is a screenshot, cropped to be small):

My goal is to be able to process any email without unicode errors, and without writing custom unicode error handling code - is that possible?

Can anyone suggest what I can do to avoid getting Unicode errors when displaying email address display names?

Upvotes: 4

Answers (1)

Jim DeLaHunt

Reputation: 11405

You have a tough problem here. Your immediate example is not tough: it is invalid, according to the rules of RFC 2047. The email.parser module is justified in rejecting it. However, email is full of content which is invalid according to the rules. Email tools often work hard to salvage something even from invalid content. What do you want your tool to do with invalid content?

Here is what is invalid with your example. I've shortened it a little. The relevant part of it reads,

b'From: =?utf-8?Q?John=2C_PMP=C2=AE=2C_CSM=C2?=\r\n =?utf-8?Q?=AE=2C?= <[email protected]>\r\n'

This probably was originally the string: From: John, PMP®, CSM®, <[email protected]>.

This is a Python bytes string, containing an From: header as encoded-words. The spec for this is RFC 2047, MIME Part Three: Message Header Extensions for Non-ASCII Text.

In the example, you see two sequences each of =?utf-8?Q? and ?=. RFC 2047, Section 2, "Syntax of encoded-words" tells us that these mark the beginning and end of two encoded-words, and that they use the UTF-8 character set and Quoted-Printable encoding. After the "PMP", there is the sequence =C2=AE. This encodes the 2-octet UTF-8 sequence 0xC2 0xAE, which is the character '®'. The the sequence =2C encodes the 1-octet UTF-8 (and ASCII) sequence 0x2C, which is the character ','.

The part between the first ?= and the second =?utf-8?Q? reads, \r\n . This is literal, not encoded according to RFC 2047. It is a continuation of a long header line by inserting a line ending and a leading blank. This is also quite legal.

Now look after the "CSM". Notice there is a sequence =C2, then the first ?= which ends the first encoded-word. After the second =?utf-8?Q? begins the second encoded-word, there is a sequence =AE. This is the same 2-octet UTF-8 sequence 0xC2 0xAE, representing the character '®' again. However, the two octets of the UTF-8 character are split across the adjacent encoded-words.

This is against the rules of RFC 2047, Section 5, "Use of encoded-words in message headers"*. It says there:

Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-word's.

Either of these two renderings of the input would be valid:

b'From: =?utf-8?Q?John=2C_PMP=C2=AE=2C_CSM=C2=AE?=\r\n =?utf-8?Q?=2C?= <[email protected]>\r\n'
b'From: =?utf-8?Q?John=2C_PMP=C2=AE=2C_CSM?=\r\n =?utf-8?Q?=C2=AE=2C?= <[email protected]>\r\n'

(This is as I read the spec. I didn't run the code to check.)

Now, you ask two questions:

My goal is to be able to process any email without unicode errors, and without writing custom unicode error handling code - is that possible?

My suggestion is "No". If you want to process any email, you will need to be prepared to handle incorrectly formed email. You will need to write custom error handling code — not just for Unicode issues, for everything — to cope with the weird stuff which will no doubt wash up.

Can anyone suggest what I can do to avoid getting Unicode errors when displaying email address display names?

For this example, I can see three approaches:

Take a look at the class email.policy.EmailPolicy(**kw) and see if you can figure out how to extend it to handle incorrectly encoded content of this sort. You are passing a relative of this class as policy in BytesHeaderParser(policy=policy.default).parsebytes(email_bytes).
Pre-process all header lines, looking at the bytes at the end and beginning of consecutive encoded-words for this problem. Fix it with your own code, then feed the corrected heading to BytesHeaderParser(). Maybe you could write a regular expression which could detect the problem.
Wrap your call to BytesHeaderParser() in an exception handler, which will try the fixes in #2 only for lines which fail. Having fixed the line, you can try BytesHeaderParser() again.

There will be other problems too. Consider structuring your code to be able to accomodate more and more fixes for invalid content, as you discover you need them.

Upvotes: 6

&#39;utf-8&#39; codec can&#39;t encode character &#39;\udcc2&#39;: surrogates not allowed

Answers (1)

Related Questions

'utf-8' codec can't encode character '\udcc2': surrogates not allowed