Skaperen
Skaperen

Reputation: 463

str.encode() changes UTF-8 characters

it has been suggested that this question is a duplicate of 6269765. I did not use any b'' literals in either the original code or the minimal. see the embeDded edits below:

I have reduced a problem I have been having today down to this minimal Python 3 code:

x='\xc3\xb3'
print(''.join([hex(ord(c))[2:] for c in x]))
print(''.join([hex(c)[2:] for c in x.encode()]))

When I run this code I get:

c3b3
c383c2b3

Is str.encode() really supposed to change UTF-8 character ó (LATIN SMALL LETTER O WITH ACUTE) to two characters ó (LATIN CAPITAL LETTER A WITH CIRCUMFLEX and SUPERSCRIPT THREE)?


edit:

No ó was entered as one commenter suggested. Only ó was entered in some cases and was read in from a text file in others. The text file involved was in the original problem. It was the system dictionary file of the current version of Ubuntu. The file is dated 23 Oct 2011 and has a filesystem path as seen in command examples of the original question.

The original problem involved encountering the word Asunción at line 1053 of that file. The ó character in Asunción has the byte sequence C3B3 which is described in the FileFormat.Info UTF-8 lookup table as LATIN SMALL LETTER O WITH ACUTE (described here for readers unable to properly read Unicode text.

No b'' literals were used in any code, neither the original, nor the minimal.

The nature of the problem was discovered as UTF characters being changed from the ó character to ó. This involved changing c3b3 to c383c2b3. The dictionary file literally contained the two bytes c3b3 which display ó as expected and as described in that UTF-8 table. The original problem was an exception being raised due to the change in length.

The use of str.encode() was made to try to solve the problem and to discover its source. It is believed that something, somewhere, did something similar to str.encode().

The minimal code to show this problem at first was:

x='Asunción'
print(' '.join([hex(ord(c))[2:] for c in x]))
print(' '.join([hex(c)[2:] for c in x.encode()]))

but I found that many people were unable to see the lower case acute o so I changed it to hexadecimal codes (\x) which had the same hexadecimal verification output both before and after the str.encode() as the first minimal example just above with the literal full word Asunción.

Then I decided it would be more minimal to use the affected character alone and no spaces in the hexadecimal output.

end of edit, back to original post:


This UTF-8 character was encountered in the American English dictionary file on the latest American English Ubuntu edition named /usr/share/dict/american-english. You can see the first word in that file with this sequence with the command:

head -1053 /usr/share/dict/american-english|tail -1

You can see it in hexadecimal with the command:

head -1053 /usr/share/dict/american-english|tail -1|od -Ad -tx1

Character descriptions were obtained from here. I am running Python 3.5.2 compiled on GCC 5.4.0 on Ubuntu 16.04.1 LTS updated 2 days ago.

edit:

is the correct answer here to avoid bytes totally and not use str.encode()? or is there a better answer?

Upvotes: 2

Views: 3482

Answers (2)

Josh Lee
Josh Lee

Reputation: 177500

You don’t even have to call encode() to see what’s in your string. Python will just show it to you in interactive mode:

>>> '\xc3\xb3'
'ó'

This is a unicode string, of length 2, whose characters are exactly what you see. No bytes or UTF-8 are involved at all, except perhaps at the boundary to send them to your terminal or to read them from your source file. If you want a unicode character in a string, you can directly insert it, or escape it with \x (if FF or less), \u (if FFFF or less), or \U (for all characters).

>>> '\xf3' == '\u00f3' == 'ó'
True

And if you do want a UTF-8 literal for some reason, that would be a bytes literal:

>>> b'\xc3\xb3'
b'\xc3\xb3'

This is a byte string, of length 2. When you ask Python to show it to you, it shows it as written, because Python doesn't know what's in your bytes.

>>> b'\xc3\xb3'.decode()
'ó'

The input is a byte string (of length 2, containing UTF-8 data) and the output is a unicode string (of length 1).

Upvotes: 1

John Machin
John Machin

Reputation: 82924

Small-q question snatched from the very large-Q Question:

When I run this code I get:

c3b3

c383c2b3

Is str.encode() really supposed to change UTF-8 character ó (LATIN SMALL LETTER O WITH ACUTE) to two characters ó (LATIN CAPITAL LETTER A WITH CIRCUMFLEX and SUPERSCRIPT THREE)?

There is no such thing as an "UTF-8 character". LATIN SMALL LETTER O WITH ACUTE is a Unicode character (python: str object). Its Unicode codepoint is U+00F3.

>>> import unicodedata as ucd
>>> smalloacute = u"\u00f3"
>>> ucd.name(smalloacute)
'LATIN SMALL LETTER O WITH ACUTE'

Now you can encode that to a bytes object:

>>> smalloacute.encode('utf8')
b'\xc3\xb3'

and write your bytes object to a file or whatever you want to do. Note that b'\xc3' is a bytes object and has no useful relation to LATIN CAPITAL LETTER A WITH CIRCUMFLEX. Likewise b'\xb3' and SUPERSCRIPT THREE.

You certainly don't want to spin the utf8 wheel a second time; the result is not very useful:

>>> `smalloacute.encode('utf8').decode('latin1').encode('utf8')
b'\xc3\x83\xc2\xb3'`

Seen that before? Note: the decode('latin1') is merely changing the type from bytes to str.

Back to the original question/statement "str.encode() changes UTF-8 characters". Short answer: no it doesn't! From the file you get a 2-byte sequence representing the Unicode o-acute character. You may want to work with that directly. Alternatively you may want to do bytes.decode() and work in str objects. You should very definitely not do bytes.kludge().encode().

Upvotes: 1

Related Questions