Joe
Joe

Reputation: 83

Powershell - Parsing HTML Table From Invoke-WebRequest

I'm using an Invoke-WebRequest to grab user information from a table on a website. The information is coming back as such:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table border='0' cellspacing='0' cellpadding='2' width='640'><tr><td class='col1bold' width='10'>&nbsp;</td><td class='col1reversebold' width='10'>&nbsp;</td><td class='col1reversebold' width='5'>&nbsp;</td><td class='col1reversebold' width='130'>F
ull Name</td><td class='col1reversebold' width='80'>Label</td><td class='col1reversebold' width='70'>Call Status</td><td class='col1reversebold' width='60'>DND/Hold</td><td class='col1reversebold' width='125'>Location</td><td class='col1reversebold'
 width='80'>Current Site</td></tr><tr><td class='col1bold' width='30'>&nbsp;</td><td class='col1'><img src=/console/images/static/B3000NW.gif></td><td class='col1'>&nbsp;</td><td class='col1'>Testing, Steve</td><td class='col1'>OR-72</td><td class='
col1'>Inactive</td><td class='col1'>&nbsp;</td><td class='col1'>703a0e88c252</td><td class='col1'>Global</td></tr><tr><td class='col1bold' width='30'>&nbsp;</td><td class='col1'><img src=/console/images/static/B3000B.gif></td><td class='col1'>&nbsp;
</td><td class='col1'>GreenTest, Connie</td><td class='col1'>OR-24</td><td class='col1'>Inactive</td><td class='col1'>&nbsp;</td><td class='col1'>Left Nurse Station</td><td class='col1'>Global</td></tr></table>

If I use the following syntax:

$GroupRequests = Invoke-WebRequest -Uri "$URI" -WebSession $session -Method GET

$HTMLTable = $GroupRequests.ParsedHtml.body.InnerHTML

I get:

<TABLE cellSpacing=0 cellPadding=2 width=640 border=0>
<TBODY>
<TR>
<TD class=col1bold width=10>&nbsp;</TD>
<TD class=col1reversebold width=10>&nbsp;</TD>
<TD class=col1reversebold width=5>&nbsp;</TD>
<TD class=col1reversebold width=130>Full Name</TD>
<TD class=col1reversebold width=80>Label</TD>
<TD class=col1reversebold width=70>Call Status</TD>
<TD class=col1reversebold width=60>DND/Hold</TD>
<TD class=col1reversebold width=125>Location</TD>
<TD class=col1reversebold width=80>Current Site</TD></TR>
<TR>
<TD class=col1bold width=30>&nbsp;</TD>
<TD class=col1><IMG src="/console/images/static/B3000NW.gif"></TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>Testing, Steve</TD>
<TD class=col1>OR-72</TD>
<TD class=col1>Inactive</TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>703a0e987f72</TD>
<TD class=col1>Global</TD></TR>
<TR>
<TD class=col1bold width=30>&nbsp;</TD>
<TD class=col1><IMG src="/console/images/static/B3000B.gif"></TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>GreenTest, Connie</TD>
<TD class=col1>OR-24</TD>
<TD class=col1>Inactive</TD>
<TD class=col1>&nbsp;</TD>
<TD class=col1>Left Nurse Station</TD>
<TD class=col1>Global</TD></TR></TBODY></TABLE>

How do I basically parse each row (<TR>) and then get the information into columns and headers? I've tried so many options with no luck.

Upvotes: 0

Views: 1509

Answers (1)

Theo
Theo

Reputation: 61218

A bit strange your ParsedHtml seems to have an XML declaration, but you could try the function below:

function ConvertFrom-HtmlTable {
    [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
    param(
        [Parameter(Mandatory = $true, Position = 0)]
        [Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
        [int]$TableIndex = 0,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
        [string]$TableId,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
        [string]$TableName,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
        [string]$TableClassName,

        # if there are no <th> tags, by appending this switch, the code
        # uses the first row in the table as column headers.
        [switch]$FirstRowHasHeaders
   )

    # Extract the table out of the web request
    switch ($PSCmdlet.ParameterSetName) {
        'ById'    { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
        'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
        'ByName'  { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
        'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
    }
    $WebRequest.Dispose()
    $WebRequest.Finalize()  # free the COM objects

    if (!$table) {
        Write-Warning "Could not find the given table."
        return $null
    }

    # load the System.Web assembly to be able to decode HTML entities
    Add-Type -AssemblyName System.Web

    $headers = @()
    # Go through all of the rows in the table
    foreach ($row in $table.Rows) {
        $cells = @($row.Cells)
        # If there is a table header or if we should consider the first row 
        # to be the header row, remember its titles
        if ($FirstRowHasHeaders -or $cells[0].tagName -eq "TH") {
            $i = 0
            $headers = @($cells | ForEach-Object {
                $i++
                # decode HTML entities and double-up quotes that the value may contain
                $th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
                # if the table header is empty, create it
                if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
            })
            # we now have headers, set the switch to $false so the rest is parsed as data
            $FirstRowHasHeaders = $false
            # proceed with the next row
            continue
        }
        # if we haven't found any table headers, make up names "H1", "H2", etc.
        if(-not $headers) {
            $headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
        }

        # Now go through the cells in the the row. For each, try to find the
        # title that represents that column and create a hashtable mapping those
        # titles to content
        $hash = [Ordered]@{}
        for ($i = 0; $i -lt $cells.Count; $i++) {
            # decode HTML entities and double-up quotes that the value may contain
            $value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
            $th = $headers[$i]
            $hash[$th] = $value.Trim()
        }
        # And finally cast that hashtable to a PSCustomObject
        [PSCustomObject]$hash
    }
}

and use it like this:

$GroupRequests = Invoke-WebRequest -Uri $URI -WebSession $session -Method GET

# the table doesn't have a name or id attribute, so use its index (guessing its the first table)
# in your example the table also does not have <th> header tags, so append -FirstRowHasHeaders
# if you want the first data row to act as column headers or accept that headers are created as
# 'H1', 'H2', 'H3' etc. Cells containing only whitespace will always be translated to 'H<columnNumber>'
$users = ConvertFrom-HtmlTable -WebRequest $GroupRequests -TableIndex 0 -FirstRowHasHeaders

# show on screen
$users | Format-Table -AutoSize

# save as CSV file
$users | Export-Csv -Path 'X:\Somewhere\users.csv' -NoTypeInformation

Upvotes: 1

Related Questions