Python lxml & string encoding issue

Question

I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly. It's probably a stupid thing, but I can't seem to figure out a solution...

Here's a simplified version of the html:


    
        
    
    
        DAÑA – bis'e



A simplified version of the code:

import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
    print(para.text)


The output in my terminal has those little boxes with a character error (for example,



which should be "– E"), but if I copy-paste from there to here, it looks like: 

>>> DAÃO   bisâe

If I do a simple echo + problem characters in the terminal they render properly, so I don't think that's the problem.

The html encoding is UTF-8 (checked with docinfo). I've tried .encode() and .decode() in various places in the code. I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).

Any ideas what's going wrong and/or how to solve?

Tomalak · Accepted Answer

Open the file with the correct encoding (I'm assuming UTF-8 here, look at the HTML file to confirm).

import lxml.html as LH

with open("path/to/file", encoding="utf8") as f:
    tree = LH.parse(f)
    root = tree.getroot()
    for para in root.iter("p"):
        print(para.text)

Background explanation of how you arrived where you currently are.

Incoming data from the server:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 91 4F         UTF-8        DAÑO                   proper decode
44 41 C3 91 4F         Latin-1      DAÃ▯O                  improper decode

The bytes should not have been decoded as Latin-1, that's an error.

C3 91 represents one character in UTF-8 (the Ñ) but it's two characters in Latin-1 (the Ã, and byte 91). But byte 91 is unused in Latin-1, so there is no character to display. I took ▯ to make it visible. A text editor might skip it altogether, showing DAÃO instead, or a weird box, or an error marker.

When writing the improperly decoded string to file:

String                 Encoded as   Result Bytes (hex)     Comment
DAÃ▯O                  UTF-8        44 41 C3 83 C2 91 4F   weird box preserved as C2 91

The string should not have been encoded as UTF-8 at this point, that's an error, too.

The Ã got converted to C3 83, which is correct for this character in UTF-8. Note how the byte sequence now matches what you told me in the comments (\xc3\x83\xc2\x91).

When reading that file:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 83 C2 91 4F   UTF-8        DAÃ▯O                  unprintable character is retained
44 41 C3 83 C2 91 4F   Latin-1      DAÃƒÂ▯O                unprintable character is retained

No matter how you decode that, it remains broken.

Your data got mangled by making two mistakes in a row: decoding it improperly, and then re-encoding it improperly again. The right thing would have been to write the bytes from the server directly to disk, without converting them to string at any point.

Python lxml & string encoding issue

Answers (2)

Related Questions

Python lxml &amp; string encoding issue

Answers (2)

Related Questions

Python lxml & string encoding issue