ignore malformed XML with Perl-XML

Question

I'm using the perl command line utility xpath to extract data from some HTML code as follows:

#!/bin/bash
echo $HTML | xpath -q -e "//h2[1]"

The HTML is malformed which causes xpath to throw the below error:

not well-formed (invalid token) at line X, column Y, byte Z:

I can't really fix the HTML since it's provided by an external source which means every time the HTML is changed I would have to fix it manually again.

I looked for xpath man which is pretty empty: http://www.linuxcertif.com/man/1/xpath.1p/

I was wondering whether there would be a way to tell xpath to ignore the malformed HTML. To give you an idea of how malformed it is here are few lines from the source code:

   <---- - instead of =

Thanks

dogbane · Accepted Answer

Try out HTML::TreeBuilder::XPath which uses an HTML parser to build a document which can then be queried using xpaths. An HTML Parser should be ok with malformed XML.

Also see this article on HTML Scraping with XPath.

ignore malformed XML with Perl-XML

Answers (2)

Related Questions