Reputation: 549
I want to extract values from a html document and in another program (ui.vision / selenium) I can do it with xpath statements. I have worked out a whole lot of working xpaths, and now I want to use them in Powershell. I have the string $html containing everything from <html>
to </html>
(incl.).
As far as I have researched, I need to have an xml object to use 'Select-Xml' with xpath statements.
In order to convert $html to xml I tried to cast:
[xml]$xml = $html
as well as
$xml = [xml]$html
and I tried to convert:
$html = $html | ConvertTo-xml
All failed. I think that the html needs to be very well-formatted, but it is not (even if it's perfect html and passes the W3 validator without warnings). It's minified and most attributes lack parentheses.
So how can I get xpath to work on a string containing a html website? I am about to resort to regular expressions, but it seems to be a lot of work to translate all the xpath statements.
Upvotes: 1
Views: 721
Reputation: 440337
HTML documents (except the XHTML variant, which is rarely seen these days) are not valid XML and therefore cannot be parsed as such.
A third-party HTML parsing library that provides an API that is similar to the standard [xml]
(System.Xml.XmlDocument
) API and therefore includes XPath support via methods such as .SelectNodes()
is the HTML Agility Pack, for which a PowerShell wrapper, the PSParseHTML
module, exists - see this answer for an example of its use.
Upvotes: 1