nikhileshwar y
nikhileshwar y

Reputation: 51

How to extract data from the html output using powershell

>$op.Content
<!DOCTYPE HTML PUBLIC "-//j3C//DTD HTML 3.2 Final//EN">
<html>
<head><meta name="robots" content="noindex" />
<title>Index of generic-releases/upgrades/sw/release</title>
</head>
<body>
<h1>Index of generic-releases/upgrades/sw/release</h1>
<pre>Name       Last modified      Size</pre><hr/>
<pre><a href="../">../</a>
<a href="10.4.0.30/">10.4.0.30/</a>  14-jan-2020 15:08    -
<a href="10.4.0.34/">10.4.0.34/</a>  14-jan-2020 20:19    -
<a href="10.5.0.46/">10.5.0.46/</a>  27-jan-2020 18:43    -
</pre>
<hr/><address style="font-size:small;">Artifactory Online Server at xxx.jfrog.io Port 80</address> 
</body></html>

1)Above is the html page with the list of software version folders 2)I want the output to be displayed according to latest time and date as shown below using powershell

10.5.0.46  27-jan-2020 18:43
10.4.0.34  14-jan-2020 20:19
10.4.0.30  14-jan-2020 15:08

3)Can anyone please help me how to achieve this output using powershell

Upvotes: 0

Views: 2519

Answers (2)

iRon
iRon

Reputation: 23623

As @Doug mentioned, it is generally a bad idea to peek and/or poke in serialized files directly, instead, it is better to use a proper (html) parser to retrieve your data.

To give you an example using the IHTMLDocument2 interface:

$Doc = New-Object -Com 'HTMLFile'
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($Content)
if ($Doc.IHTMLDocument2_Write) { $Doc.IHTMLDocument2_Write($Unicode) } else { $Doc.write($Unicode) }

$Pres = $Doc.getElementsByTagName('pre')
$Headers = $Pres[0].innerText -Split '\s\s+'
$Pres[1].childNodes | Foreach-Object {
    if ($_.tagName -eq 'A') { $A = $_ }
    elseif ($_.nodeValue -is [string]) {
        $Data = ($A.innerText + $_.nodeValue) -Split '\s\s+'
        $Properties = [ordered]@{}
        $i = 0
        Foreach ($Header in $Headers) {
            if ( $Data -le $i) { $Properties[$Header] = $Data[$i++] }
        }
        [pscustomobject]$Properties
    }
} | Sort-Object { $_.'Last modified' -as [DateTime] } -Descending

Name       Last modified     Size
----       -------------     ----
10.5.0.46/ 27-jan-2020 18:43 -…
10.4.0.34/ 14-jan-2020 20:19 -…
10.4.0.30/ 14-jan-2020 15:08 -…
../…

Upvotes: 1

Doug Maurer
Doug Maurer

Reputation: 8868

It's generally a bad idea to try and parse HTML manually. There are HTML parsing libraries that exist for this purpose. For your simple example, you could get your desired results with regex. It is assumed that $op.Content is an array of strings. If it is actually one string, we'd need to make a slight adjustment.

$results = switch -Regex ($op.Content){
    '>(.+?)/.+?\s+(.+?)\s{2,}' {
        [PSCustomObject]@{
            IP   = $Matches.1
            Date = $matches.2 -as [datetime]
        }
    }
}

$results | Sort-Object -Property date -Descending

Output on my machine

IP        Date                 
--        ----                 
10.5.0.46 2020-01-27 6:43:00 PM
10.4.0.34 2020-01-14 8:19:00 PM
10.4.0.30 2020-01-14 3:08:00 PM

Upvotes: 0

Related Questions