jumbo
jumbo

Reputation: 4878

HTMLDocumentClass and getElementsByClassName not working

Last year I had powershell (v3) script that parsed HTML of one festival page (and generate XML for my Windows Phone app).

I also was asking a question about it here and it worked like a charm.

But when I run the script this year, it is not working. To be specific - the method getElemntsByClassName is not returning anything. I tried that method also on other web pages with no luck.

Here is my code from last year, that is not working now:

$tmpFile_bandInfo = "C:\band.txt"
Write-Host "Stahuji kapelu $($kap.Nazev) ..." -NoNewline    
Invoke-WebRequest http://www.colours.cz/ucinkujici/the-asteroids-galaxy-tour/ -OutFile $tmpFile_bandInfo
$content = gc $tmpFile_bandInfo -Encoding utf8 -raw
$ParsedHtml = New-Object -com "HTMLFILE"
$ParsedHtml.IHTMLDocument2_write($content)
$ParsedHtml.Close()
$bodyK = $ParsedHtml.body
$bodyK.getElementsByClassName("body four column page") # this returns NULL
$page = $page.item(0)
$aside = $page.getElementsByTagName("aside").item(0)
$img = $aside.getElementsByTagName("img").item(0)
$imgPath = $img.src

this is code I used to workaround this:

$sec = $bodyK.getElementsByTagName("section") | ? ClassName -eq "body four column page"
# but now I have no innerHTML, only the lonely tag SECTION
# so I am walking through siblings
$img = $sec.nextSibling.nextSibling.nextSibling.getElementsByTagName("img").item(0)
$imgPath = $img.src

This works, but this seems silly solution to me.
Anyone knows what I am doing wrong?

Upvotes: 1

Views: 4167

Answers (2)

danekan
danekan

Reputation: 482

The issue is not a bug but rather that the return where you're seeing NULL is because it's actually a reference to a proxy HTMLFile COM call to the DOM model.

You can force this to operate and return the underlying strings by boxing it into an array @() as such:

@($mybody.getElementsByClassName("body four column page")).textContent

If you do a Select-Object on it, that also automatically happens and it will unravel it via COM and return it as a string

$mybody.getElementsByClassName("body four column page") | Select-Object -Property TextContent

Upvotes: 0

jumbo
jumbo

Reputation: 4878

I actually solved this problem by abandoning Invoke-WebRequest cmdlet and by adopting HtmlAgilityPack.

I transformed my former sequential HTML parsing into few XPath queries (everything stayed in powershell script). This solution is much more elegant and HtmlAgilityPack is real badass ;) It is really honour to work with project like this!

Upvotes: 2

Related Questions