Parsing bad XHTML

Question

My new project is to extract data from the Naxos Glossary of Musical Terms, a great resource whose text data I want to process and extract to a database to use on another, simpler website I'll create.

My only problem is awful XHTML formatting. The W3C XHTML validation raises 318 errors and 54 warnings. Even a HTML Tidier I found can't fix it all.

I'm using Python 3.67 and the page I'm parsing was ASP. I've tested LXML and Python XML modules, but both fail.

Can anyone suggest any other tidiers or modules? Or will I have to use some sort of raw text manipulation (yuck!)?

My code:

LXML:

from lxml import etree

file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)

Error:

  Traceback (most recent call last):
  File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in 
    parsed = etree.parse(file)
  File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
  File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>

Python XML (using the tidied XHTML):

import xml.etree.ElementTree as ET

file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())

# Top-level elements
print(root.findall("."))

Error:

  Traceback (most recent call last):
  File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in 
    root = ET.fromstring(file.read())
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)
  File "", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33

pguardiario · Accepted Answer

Lxml likely thinks you're giving it xml that way. Try it like this:

from lxml import html
from cssselect import GenericTranslator, SelectorError

file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())

Also instead of "HTML Tidiers" just open it in chrome and copy the html in the elements panel.

Parsing bad XHTML

Answers (1)

Related Questions