Bob
Bob

Reputation: 41

Python lxml & string encoding issue

I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly. It's probably a stupid thing, but I can't seem to figure out a solution...

Here's a simplified version of the html:

<html>
    <head>
        <meta content="text/html" charset="UTF-8"/>
    </head>
    <body>
        <p>DAÑA – bis'e</p> <!---that's an N dash and the single quote is curly--->
    </body
</html

A simplified version of the code:

import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
    print(para.text)

The output in my terminal has those little boxes with a character error (for example,

enter image description here

which should be "– E"), but if I copy-paste from there to here, it looks like:

>>> DAÃO bisâe

If I do a simple echo + problem characters in the terminal they render properly, so I don't think that's the problem.

The html encoding is UTF-8 (checked with docinfo). I've tried .encode() and .decode() in various places in the code. I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).

Any ideas what's going wrong and/or how to solve?

Upvotes: 0

Views: 2485

Answers (2)

Tomalak
Tomalak

Reputation: 338426

Open the file with the correct encoding (I'm assuming UTF-8 here, look at the HTML file to confirm).

import lxml.html as LH

with open("path/to/file", encoding="utf8") as f:
    tree = LH.parse(f)
    root = tree.getroot()
    for para in root.iter("p"):
        print(para.text)

Background explanation of how you arrived where you currently are.

Incoming data from the server:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 91 4F         UTF-8        DAÑO                   proper decode
44 41 C3 91 4F         Latin-1      DAÃ▯O                  improper decode

The bytes should not have been decoded as Latin-1, that's an error.

C3 91 represents one character in UTF-8 (the Ñ) but it's two characters in Latin-1 (the Ã, and byte 91). But byte 91 is unused in Latin-1, so there is no character to display. I took ▯ to make it visible. A text editor might skip it altogether, showing DAÃO instead, or a weird box, or an error marker.

When writing the improperly decoded string to file:

String                 Encoded as   Result Bytes (hex)     Comment
DAÃ▯O                  UTF-8        44 41 C3 83 C2 91 4F   weird box preserved as C2 91

The string should not have been encoded as UTF-8 at this point, that's an error, too.

The à got converted to C3 83, which is correct for this character in UTF-8. Note how the byte sequence now matches what you told me in the comments (\xc3\x83\xc2\x91).

When reading that file:

Bytes (hex)            Decoded as   Result String          Comment
44 41 C3 83 C2 91 4F   UTF-8        DAÃ▯O                  unprintable character is retained
44 41 C3 83 C2 91 4F   Latin-1      DAÃÂ▯O                unprintable character is retained

No matter how you decode that, it remains broken.

Your data got mangled by making two mistakes in a row: decoding it improperly, and then re-encoding it improperly again. The right thing would have been to write the bytes from the server directly to disk, without converting them to string at any point.

Upvotes: 1

Ferran
Ferran

Reputation: 840

I've found the unidecode package to work quite well converting non-ascii characters to the closest ascii.

from unidecode import unidecode
def check_ascii(in_string):
    if in_string.isascii():  # Available in python 3.7+
        return in_string
    else:
        return unidecode(in_string)  # Converts non-ascii characters to the closest ascii

Then if you believe some text might contain non-ascii characters you can pass it to the above function. In your case after extracting the text between the html tags you can pass it with:

for para in root.iter("p"):
    print(check_ascii(para.text))

You can find details about the package here: https://pypi.org/project/Unidecode/

Upvotes: 0

Related Questions