Souf
Souf

Reputation: 389

PowerShell: Extracting HTML table as CSV

I am trying to extract a HTML table to a CSV file. I do not know a lot from PowerShell but online I've found some examples, but I always get the same error message:

You cannot call a method on a null-valued expression. At line:8 char:1

  • $table = $oHTML.ParsedHtml.body.getElementsByTagName('table')[0]
  • CategoryInfo : InvalidOperation: (:) [], RuntimeException
  • FullyQualifiedErrorId : InvokeMethodOnNull

This is what I have for the moment, but am a bit stuck.

$url = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"
$webClient = New-Object System.Net.Webclient
$webClient.DownloadString($url) | Out-File -FilePath C:\Users\USER\Downloads\DUMP\dump.html

$oHTML = Get-Content C:\Users\USER\Downloads\DUMP\dump.html -Raw

#Just grabbing first table for my testing
$table = $oHTML.ParsedHtml.body.getElementsByTagName('Table')[0]

$Headers = ($table.Rows[0].Cells | Select -ExpandProperty innerText).trim()
$psCollection=@()

$dataRows = $table.Rows | Select -Skip 1
foreach ($tablerow in $dataRows) {
    $cells = ($tablerow.Cells | Select -ExpandProperty innerText).trim()
    $obj = New-Object -TypeName PSObject
    $count = 0;
    foreach ($cell in $cells) {
        if ($count -lt $Headers.length) {
            $obj | Add-Member -MemberType NoteProperty -Name $Headers[$count++] -Value $cell
        }
    }
    $psCollection+=$obj
}

$psCollection | Select 'MyField' -Unique

Upvotes: 2

Views: 3668

Answers (1)

iRon
iRon

Reputation: 23623

At the PowerShell Gallery you will find a Read-HtmlTable cmdlet based on the IHTMLDocument2 interface:

Installation

Install-Script -Name Read-HtmlTable

Usage:

$uri = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"

Read-HtmlTable $Uri -Table 0 |Format-Table # or: |Export-Csv .\output.csv

Results:

Version Servicing option               Availability date OS build   Latest revision date End of service: Home, Pro, Pro Education, Pro for Workstations and IoT Core End of service: Enterprise, Education and IoT Enterprise
------- ----------------               ----------------- --------   -------------------- --------------------------------------------------------------------------- --------------------------------------------------------
20H2    Semi-Annual Channel            2020-10-20        19042.928  2021-04-13           2022-05-10                                                                  2023-05-09
2004    Semi-Annual Channel            2020-05-27        19041.928  2021-04-13           2021-12-14                                                                  2021-12-14
1909    Semi-Annual Channel            2019-11-12        18363.1500 2021-04-13           2021-05-11                                                                  2022-05-10
1809    Semi-Annual Channel            2019-03-28        17763.1879 2021-04-13           End of service                                                              2021-05-11
1809    Semi-Annual Channel (Targeted) 2018-11-13        17763.1879 2021-04-13           End of service                                                              2021-05-11
1803    Semi-Annual Channel            2018-07-10        17134.2145 2021-04-13           End of service                                                              2021-05-11
1803    Semi-Annual Channel (Targeted) 2018-04-30        17134.2145 2021-04-13           End of service                                                              2021-05-11

Parameters

-InputObject <string>
The html content (string) that contains a html table.

If the string is less than 2048 characters and contains a valid uri protocol, the content is downloaded from the concerned location.

-uri <Uri>
A uri location referring to the html content that contains the html table

-Header <string[]>
Specifies an alternate column header row for the imported string. The column header determines the property names of the objects created by ConvertFrom-Csv.

Enter column headers as a comma-separated list. Do not enclose the header string in quotation marks. Enclose each column header in single quotation marks.

If you enter fewer column headers than there are data columns, the remaining data columns are discarded. If you enter more column headers than there are data columns, the additional column headers are created with empty data columns.

A $Null instead of a column name, will span the respective column with previous column.

Note: To select specific columns or skip any data (or header) rows, use Select-Object cmdlet

-TableIndex <int[]>
Specifies which tables should be selected from the html content (where 0 refers to the first table). By default, all tables are extracted from the content.

Note: in case of multiple tables, the headers should be unified to properly output or display of each table.
(see: https://github.com/PowerShell/PowerShell/issues/13906)

-Separator <string>
Specifies the characters used to join a header with is spanned over multiple columns.
default: space character

-Delimiter <string>
Specifies the characters used to join a header with is spanned over multiple rows.
default: the newline characters used by the operating system

-NoTrim
By default, all header - and data text is trimmed, to disable trimming, use the -NoTrim parameter.

Upvotes: 3

Related Questions