Max
Max

Reputation: 13334

ignore malformed XML with Perl-XML

I'm using the perl command line utility xpath to extract data from some HTML code as follows:

#!/bin/bash
echo $HTML | xpath -q -e "//h2[1]"

The HTML is malformed which causes xpath to throw the below error:

not well-formed (invalid token) at line X, column Y, byte Z:

I can't really fix the HTML since it's provided by an external source which means every time the HTML is changed I would have to fix it manually again.

I looked for xpath man which is pretty empty: http://www.linuxcertif.com/man/1/xpath.1p/

I was wondering whether there would be a way to tell xpath to ignore the malformed HTML. To give you an idea of how malformed it is here are few lines from the source code:

<div id="header-background" style="top: 42px; >&nbsp;</div> <---- missing closing "
<div id-"page-inner">   <---- - instead of =

Thanks

Upvotes: 0

Views: 1678

Answers (2)

mirod
mirod

Reputation: 16171

xml_grep, a command line tool which comes with XML::Twig, can be used to extract data from HTML using XPath. Normally it works on XML, but you can use the -html option to process HTML (under the hood it uses HTML::TreeBuilder to convert the XML to HTML).

For example:

> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange

Upvotes: 4

dogbane
dogbane

Reputation: 274828

Try out HTML::TreeBuilder::XPath which uses an HTML parser to build a document which can then be queried using xpaths. An HTML Parser should be ok with malformed XML.

Also see this article on HTML Scraping with XPath.

Upvotes: 5

Related Questions