W. Gielis
W. Gielis

Reputation: 59

PowerShell inspect a HTML webpage

I am trying to use PowerShell to get the source of the images of a web page (easy), but only within a certain div element. I tried:

    $ie = New-Object -com InternetExplorer.Application
$ie.visible = $false

$ie.navigate('https://www.lachainemeteo.com/meteo-belgique/ville-14875/previsions-meteo-tournai-demain')
While ($ie.Busy -eq $true){Start-Sleep -seconds 1;}

Foreach($q in $ie.document.body.getElementsByClassName("quarter").GetElementsByElementName("img"))
{
    Write-Output $q.src
}

But this gives an error in the ISE: Method invocation failed because [System.__ComObject] does not contain a method named 'GetElementsByElementName'. quarter is fine as div elements and I am able to get the innertext of each (5) quarter div's on the page. The difficulty is in grabbing the images within each quarter div.

Here is an image of how the HTML looks like: http://www.wimgielis.com/a.png

Can anyone point out my error please ? Thanks !

Upvotes: 1

Views: 2517

Answers (1)

postanote
postanote

Reputation: 16106

That error is pretty specific. You can do a search specifically for it and get those details.

All that being said, walking a webpage is a very common thing using Powershell, and there are literally tons of examples on StackOverflow and other sites on the topic.

https://stackoverflow.com/search?q=%5Bpowershell%5D+%27parse+webpage%27

When you walk the page, you do have to ask specifically for the objects it contains. Also, If you are just after the source, there is no reason to open IE or any other browser. That is what the web cmdlets...

Invoke-WebRequest   
#Gets content from a webpage on the Internet.

Invoke-WebRequest   
#Gets content from a web page on the Internet.

... are for.

Here are the kinds of things you could leverage without a browser or COM instance to walk the page to see what is really accessible, before making further attempts to interact with it:

   ### How to scrape a web page with PowerShell

   $w = Invoke-WebRequest -Uri 'https://www.reddit.com/r/PowerShell'

   # TypeName
   $w | Get-Member

   <#
      TypeName: Microsoft.PowerShell.Commands.HtmlWebResponseObject

   Name              MemberType   Definition                                                                 
   ----              ----------   ----------                                                                 
   Dispose           Method       void Dispose(), void IDisposable.Dispose()                                 
   Equals            Method       bool Equals(System.Object obj)                                             
   GetHashCode       Method       int GetHashCode()                                                          
   GetType           Method       type GetType()                                                             
   ToString          Method       string ToString()                                                          
   AllElements       Property     Microsoft.PowerShell.Commands.WebCmdletElementCollection AllElements {get;}
   BaseResponse      Property     System.Net.WebResponse BaseResponse {get;set;}                             
   Content           Property     string Content {get;}                                                      
   Forms             Property     Microsoft.PowerShell.Commands.FormObjectCollection Forms {get;}            
   Headers           Property     System.Collections.Generic.Dictionary[string,string] Headers {get;}        
   Images            Property     Microsoft.PowerShell.Commands.WebCmdletElementCollection Images {get;}     
   InputFields       Property     Microsoft.PowerShell.Commands.WebCmdletElementCollection InputFields {get;}
   Links             Property     Microsoft.PowerShell.Commands.WebCmdletElementCollection Links {get;}      
   ParsedHtml        Property     mshtml.IHTMLDocument2 ParsedHtml {get;}                                    
   RawContent        Property     string RawContent {get;set;}                                               
   RawContentLength  Property     long RawContentLength {get;}                                               
   RawContentStream  Property     System.IO.MemoryStream RawContentStream {get;}                             
   Scripts           Property     Microsoft.PowerShell.Commands.WebCmdletElementCollection Scripts {get;}    
   StatusCode        Property     int StatusCode {get;}                                                      
   StatusDescription Property     string StatusDescription {get;}                                            
   MSDN              ScriptMethod System.Object MSDN();    
   #>


$w.StatusCode

$w.AllElements
$w.AllElements.Count
$w.Links.Count
$w.Links

$w.Forms
$w.Forms[0].Fields

$w.RawContent

$w.ParsedHtml

$w = Invoke-WebRequest -Uri 'https://en.wikipedia.org/wiki/PowerShell'
$w.AllElements.Count
$w.Links.Count
$w.AllElements | 
Where-Object -Property 'TagName' -EQ 'P' | 
Select-Object -Property 'InnerText'

$w = Invoke-WebRequest -Uri 'https://www.reddit.com/r/aww'
$w.Links

$w = Invoke-WebRequest -Uri 'https://www.reddit.com/r/PowerShell'
$w.AllElements | 
Where-Object -Property 'TagName' -EQ 'H2' | 
Select-Object -Property 'InnerText'

$w = Invoke-WebRequest -Uri 'https://darksky.net/forecast/41.8756, -87.6244/us12/en'
$w.AllElements | 
Where-Object Class -EQ 'summary swap' | 
Select-Object -Property 'OuterText'

Also, note that some web sites will block/stop you from using automation against them, by specific design, and will just generate errors when you try.

Upvotes: 1

Related Questions