Vikash Balasubramanian
Vikash Balasubramanian

Reputation: 3233

Python understanding unicode conversion

I have a text dataset which had some encoding issues. The author instructed to do:

for line in fpointer:
    line.encode('latin-1').decode('utf-8')

To fix the issues.

I wanted to see why it was required, I opened the file before fixing and saw this line:

103 But in Imax 3-D , the clichés disappear into the vertiginous perspectives opened up by the photography .

After conversion it became:

103 But in Imax 3-D , the clichés disappear into the vertiginous perspectives opened up by the photography .

It makes sense.

But i could not understand what could have caused the original issue and how did the fix work?

I referred the unicode python link : https://docs.python.org/3/howto/unicode.html

I also checked characters and their values:

The utf-8 encoding for é is c3a9 and the iso-8859-1 encoding for à is c3 and for © it is a9.

It makes some sense but i am not able to make the connection.

How exactly is the line stored in the original file and how did the code snippet fix it?

Upvotes: 1

Views: 1239

Answers (3)

jsbueno
jsbueno

Reputation: 110666

So - what happened is that the text that you had had been "double-encoded"as utf-8.

So, at some point in the process that generated the data you had, the text that already had an internal representation of "\xc3\xa9" for "é" was interpreted as being in latin-1, and re-transformed from "latin1" (where the "\xc3\xa9" represents "é") to utf-8, so that ach character was expanded to be in two bytes, becoming: "\xc3\x83" "\xc2\xa9" (the utf-8 for "é"). As @Novoselov puts it in the other answer this corruption likely came out of you opening the file to read as text, without specifying an encoding on Windows: Python will think the file is "latin-1", the default Windows encoding, and therefore read each byte in there, which is part of an- utf-8 text sequence as a single latin-1 character.

What the fix did: your system setup is already configured to read text as utf-8 - so when you got the lines in the for loop you got Python-3 strings (Python-2 unicode) correctly interpreted for the UTF-8 characters on the text file. So the 4 byte sequence became 2 text characters. Now, one characteristic of the "latin1" encoding is that it is "transparent": it is equivalent to perform no transform at all in the text bytes. In other words, each character represented by a value that fits in a single byte in Python's Unicode internal representation becomes a single byte in the encoded byte-string. (And each character whose value does not fit in a byte can't be encoded as Latin-1 at all, yielding an Unicode-Encode error).

So, after the "transparent" encoding step, you have bytes that represent your text - this time with only "one pass" of utf-8 encoding. And decoding these bytes as "utf-8" yielded you the correct text for the file.

Again:

This was the original text: "cliché". Encoded to UTF-8 it becomes like this: b'clich\xc3\xa9' But the original process, that created your file, thought of this sequence as being in latin-1, so reconverted both > 0x80 characters to utf-8: b'clich\xc3\x83\xc2\xa9'. And this is what prints as "cliché"

On reading, Python3 reads: b'clich\xc3\x83\xc2\xa9' from the disk, and returns to you "cliché" as (unicode) text. You encode this to bytes, and gets b'clich\xc3\xa9' with the call to "encode('latin-1'). Finally you then "decode" that from "utf-8" getting the text "cliché".

Python3 does not easily allow one to spoil text like this. To go from the text to the incorrect version you had, one has also to use the "transparent" encoding "latin-1" - this is an example:

In [10]: a = "cliché"

In [11]: b = a.encode("utf-8")

In [12]: b
Out[12]: b'clich\xc3\xa9'

In [13]: c = b.decode("latin1").encode("utf-8")

In [14]: c
Out[14]: b'clich\xc3\x83\xc2\xa9'

Upvotes: 3

Serge Ballesta
Serge Ballesta

Reputation: 149155

From your comment, you say that you are opening a text file in Python 3 without specifying any encoding. In that case, Python uses the system encoding which is Latin1 on Windows.

That is enough to explain what you get if the file was originaly utf8 encoded. But IMHO the correct way is to specify the file encoding in the open function:

fd = open(filename, encoding='utf8')

that way, you directly get the correct characters with no need for the encode-decode correction.

Upvotes: 0

Ilia Novoselov
Ilia Novoselov

Reputation: 363

The original text was encoded in utf-8, but some process decoded it as latin1 and then encoded it as utf-8 again.

So to get original text, you have to reverse this process: you decode text from file as utf-8 (this is not included in your snippet, but I guess you open it with utf-8 encoding), then encode it as latin1, then decode again as utf-8.

Upvotes: 1

Related Questions