jamacoe
jamacoe

Reputation: 549

How to work with xpath on strings containing html in Powershell?

I want to extract values from a html document and in another program (ui.vision / selenium) I can do it with xpath statements. I have worked out a whole lot of working xpaths, and now I want to use them in Powershell. I have the string $html containing everything from <html> to </html> (incl.).

As far as I have researched, I need to have an xml object to use 'Select-Xml' with xpath statements.

In order to convert $html to xml I tried to cast:

[xml]$xml = $html

as well as

 $xml = [xml]$html

and I tried to convert:

$html = $html | ConvertTo-xml

All failed. I think that the html needs to be very well-formatted, but it is not (even if it's perfect html and passes the W3 validator without warnings). It's minified and most attributes lack parentheses.

So how can I get xpath to work on a string containing a html website? I am about to resort to regular expressions, but it seems to be a lot of work to translate all the xpath statements.

Upvotes: 1

Views: 721

Answers (1)

mklement0
mklement0

Reputation: 440337

HTML documents (except the XHTML variant, which is rarely seen these days) are not valid XML and therefore cannot be parsed as such.

A third-party HTML parsing library that provides an API that is similar to the standard [xml] (System.Xml.XmlDocument) API and therefore includes XPath support via methods such as .SelectNodes() is the HTML Agility Pack, for which a PowerShell wrapper, the PSParseHTML module, exists - see this answer for an example of its use.

Upvotes: 1

Related Questions