Reputation: 253
My new project is to extract data from the Naxos Glossary of Musical Terms, a great resource whose text data I want to process and extract to a database to use on another, simpler website I'll create.
My only problem is awful XHTML formatting. The W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier I found can't fix it all.
I'm using Python 3.67 and the page I'm parsing was ASP. I've tested LXML and Python XML modules, but both fail.
Can anyone suggest any other tidiers or modules? Or will I have to use some sort of raw text manipulation (yuck!)?
My code:
LXML:
from lxml import etree
file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
parsed = etree.parse(file)
File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>
Python XML (using the tidied XHTML):
import xml.etree.ElementTree as ET
file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())
# Top-level elements
print(root.findall("."))
Error:
Traceback (most recent call last):
File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
root = ET.fromstring(file.read())
File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33
Upvotes: 0
Views: 889
Reputation: 55002
Lxml likely thinks you're giving it xml that way. Try it like this:
from lxml import html
from cssselect import GenericTranslator, SelectorError
file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())
Also instead of "HTML Tidiers" just open it in chrome and copy the html in the elements panel.
Upvotes: 1