Reputation: 41
I'm using lxml to extract text from html docs and I cannot get some characters from the text to render properly. It's probably a stupid thing, but I can't seem to figure out a solution...
Here's a simplified version of the html:
<html>
<head>
<meta content="text/html" charset="UTF-8"/>
</head>
<body>
<p>DAÑA – bis'e</p> <!---that's an N dash and the single quote is curly--->
</body
</html
A simplified version of the code:
import lxml.html as LH
htmlfile = "path/to/file"
tree = LH.parse(htmlfile)
root = tree.getroot()
for para in root.iter("p"):
print(para.text)
The output in my terminal has those little boxes with a character error (for example,
which should be "– E"), but if I copy-paste from there to here, it looks like:
>>> DAÃO bisâe
If I do a simple echo
+ problem characters in the terminal they render properly, so I don't think that's the problem.
The html encoding is UTF-8 (checked with docinfo
). I've tried .encode() and .decode() in various places in the code. I also tried the lxml.etree.tostring() with utf-8 declaration (but then .iter() doesn't work ('bytes' object has no attribute 'iter'), or if I put it towards the endnodes in the code, the .text doesn't work ('bytes' object has no attribute 'text')).
Any ideas what's going wrong and/or how to solve?
Upvotes: 0
Views: 2485
Reputation: 338426
Open the file with the correct encoding (I'm assuming UTF-8 here, look at the HTML file to confirm).
import lxml.html as LH
with open("path/to/file", encoding="utf8") as f:
tree = LH.parse(f)
root = tree.getroot()
for para in root.iter("p"):
print(para.text)
Background explanation of how you arrived where you currently are.
Incoming data from the server:
Bytes (hex) Decoded as Result String Comment 44 41 C3 91 4F UTF-8 DAÑO proper decode 44 41 C3 91 4F Latin-1 DAÃ▯O improper decode
The bytes should not have been decoded as Latin-1, that's an error.
C3 91
represents one character in UTF-8 (the Ñ) but it's two characters in Latin-1 (the Ã, and byte 91). But byte 91 is unused in Latin-1, so there is no character to display. I took ▯ to make it visible. A text editor might skip it altogether, showing DAÃO
instead, or a weird box, or an error marker.
When writing the improperly decoded string to file:
String Encoded as Result Bytes (hex) Comment DAÃ▯O UTF-8 44 41 C3 83 C2 91 4F weird box preserved as C2 91
The string should not have been encoded as UTF-8 at this point, that's an error, too.
The Ã
got converted to C3 83
, which is correct for this character in UTF-8. Note how the byte sequence now matches what you told me in the comments (\xc3\x83\xc2\x91
).
When reading that file:
Bytes (hex) Decoded as Result String Comment 44 41 C3 83 C2 91 4F UTF-8 DAÃ▯O unprintable character is retained 44 41 C3 83 C2 91 4F Latin-1 DAÃÂ▯O unprintable character is retained
No matter how you decode that, it remains broken.
Your data got mangled by making two mistakes in a row: decoding it improperly, and then re-encoding it improperly again. The right thing would have been to write the bytes from the server directly to disk, without converting them to string at any point.
Upvotes: 1
Reputation: 840
I've found the unidecode
package to work quite well converting non-ascii characters to the closest ascii.
from unidecode import unidecode
def check_ascii(in_string):
if in_string.isascii(): # Available in python 3.7+
return in_string
else:
return unidecode(in_string) # Converts non-ascii characters to the closest ascii
Then if you believe some text might contain non-ascii characters you can pass it to the above function. In your case after extracting the text between the html tags you can pass it with:
for para in root.iter("p"):
print(check_ascii(para.text))
You can find details about the package here: https://pypi.org/project/Unidecode/
Upvotes: 0