Reputation: 4878
Last year I had powershell (v3) script that parsed HTML of one festival page (and generate XML for my Windows Phone app).
I also was asking a question about it here and it worked like a charm.
But when I run the script this year, it is not working. To be specific - the method getElemntsByClassName is not returning anything. I tried that method also on other web pages with no luck.
Here is my code from last year, that is not working now:
$tmpFile_bandInfo = "C:\band.txt"
Write-Host "Stahuji kapelu $($kap.Nazev) ..." -NoNewline
Invoke-WebRequest http://www.colours.cz/ucinkujici/the-asteroids-galaxy-tour/ -OutFile $tmpFile_bandInfo
$content = gc $tmpFile_bandInfo -Encoding utf8 -raw
$ParsedHtml = New-Object -com "HTMLFILE"
$ParsedHtml.IHTMLDocument2_write($content)
$ParsedHtml.Close()
$bodyK = $ParsedHtml.body
$bodyK.getElementsByClassName("body four column page") # this returns NULL
$page = $page.item(0)
$aside = $page.getElementsByTagName("aside").item(0)
$img = $aside.getElementsByTagName("img").item(0)
$imgPath = $img.src
this is code I used to workaround this:
$sec = $bodyK.getElementsByTagName("section") | ? ClassName -eq "body four column page"
# but now I have no innerHTML, only the lonely tag SECTION
# so I am walking through siblings
$img = $sec.nextSibling.nextSibling.nextSibling.getElementsByTagName("img").item(0)
$imgPath = $img.src
This works, but this seems silly solution to me.
Anyone knows what I am doing wrong?
Upvotes: 1
Views: 4167
Reputation: 482
The issue is not a bug but rather that the return where you're seeing NULL is because it's actually a reference to a proxy HTMLFile COM call to the DOM model.
You can force this to operate and return the underlying strings by boxing it into an array @() as such:
@($mybody.getElementsByClassName("body four column page")).textContent
If you do a Select-Object on it, that also automatically happens and it will unravel it via COM and return it as a string
$mybody.getElementsByClassName("body four column page") | Select-Object -Property TextContent
Upvotes: 0
Reputation: 4878
I actually solved this problem by abandoning Invoke-WebRequest
cmdlet and by adopting HtmlAgilityPack.
I transformed my former sequential HTML parsing into few XPath queries (everything stayed in powershell script). This solution is much more elegant and HtmlAgilityPack is real badass ;) It is really honour to work with project like this!
Upvotes: 2