Reputation: 389
I am trying to extract a HTML table to a CSV file. I do not know a lot from PowerShell but online I've found some examples, but I always get the same error message:
You cannot call a method on a null-valued expression. At line:8 char:1
- $table = $oHTML.ParsedHtml.body.getElementsByTagName('table')[0]
- CategoryInfo : InvalidOperation: (:) [], RuntimeException
- FullyQualifiedErrorId : InvokeMethodOnNull
This is what I have for the moment, but am a bit stuck.
$url = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"
$webClient = New-Object System.Net.Webclient
$webClient.DownloadString($url) | Out-File -FilePath C:\Users\USER\Downloads\DUMP\dump.html
$oHTML = Get-Content C:\Users\USER\Downloads\DUMP\dump.html -Raw
#Just grabbing first table for my testing
$table = $oHTML.ParsedHtml.body.getElementsByTagName('Table')[0]
$Headers = ($table.Rows[0].Cells | Select -ExpandProperty innerText).trim()
$psCollection=@()
$dataRows = $table.Rows | Select -Skip 1
foreach ($tablerow in $dataRows) {
$cells = ($tablerow.Cells | Select -ExpandProperty innerText).trim()
$obj = New-Object -TypeName PSObject
$count = 0;
foreach ($cell in $cells) {
if ($count -lt $Headers.length) {
$obj | Add-Member -MemberType NoteProperty -Name $Headers[$count++] -Value $cell
}
}
$psCollection+=$obj
}
$psCollection | Select 'MyField' -Unique
Upvotes: 2
Views: 3668
Reputation: 23623
At the PowerShell Gallery you will find a Read-HtmlTable
cmdlet based on the IHTMLDocument2 interface:
Installation
Install-Script -Name Read-HtmlTable
Usage:
$uri = "https://winreleaseinfoprod.blob.core.windows.net/winreleaseinfoprod/en-US.html"
Read-HtmlTable $Uri -Table 0 |Format-Table # or: |Export-Csv .\output.csv
Results:
Version Servicing option Availability date OS build Latest revision date End of service: Home, Pro, Pro Education, Pro for Workstations and IoT Core End of service: Enterprise, Education and IoT Enterprise
------- ---------------- ----------------- -------- -------------------- --------------------------------------------------------------------------- --------------------------------------------------------
20H2 Semi-Annual Channel 2020-10-20 19042.928 2021-04-13 2022-05-10 2023-05-09
2004 Semi-Annual Channel 2020-05-27 19041.928 2021-04-13 2021-12-14 2021-12-14
1909 Semi-Annual Channel 2019-11-12 18363.1500 2021-04-13 2021-05-11 2022-05-10
1809 Semi-Annual Channel 2019-03-28 17763.1879 2021-04-13 End of service 2021-05-11
1809 Semi-Annual Channel (Targeted) 2018-11-13 17763.1879 2021-04-13 End of service 2021-05-11
1803 Semi-Annual Channel 2018-07-10 17134.2145 2021-04-13 End of service 2021-05-11
1803 Semi-Annual Channel (Targeted) 2018-04-30 17134.2145 2021-04-13 End of service 2021-05-11
Parameters
-InputObject <string>
The html content (string) that contains a html table.
If the string is less than 2048 characters and contains a valid uri protocol, the content is downloaded from the concerned location.
-uri <Uri>
A uri location referring to the html content that contains the html table
-Header <string[]>
Specifies an alternate column header row for the imported string. The column header determines the property
names of the objects created by ConvertFrom-Csv
.
Enter column headers as a comma-separated list. Do not enclose the header string in quotation marks. Enclose each column header in single quotation marks.
If you enter fewer column headers than there are data columns, the remaining data columns are discarded. If you enter more column headers than there are data columns, the additional column headers are created with empty data columns.
A $Null
instead of a column name, will span the respective column with previous column.
Note: To select specific columns or skip any data (or header) rows, use
Select-Object
cmdlet
-TableIndex <int[]>
Specifies which tables should be selected from the html content (where 0 refers to the first table).
By default, all tables are extracted from the content.
Note: in case of multiple tables, the headers should be unified to properly output or display of each table.
(see: https://github.com/PowerShell/PowerShell/issues/13906)
-Separator <string>
Specifies the characters used to join a header with is spanned over multiple columns.
default: space character
-Delimiter <string>
Specifies the characters used to join a header with is spanned over multiple rows.
default: the newline characters used by the operating system
-NoTrim
By default, all header - and data text is trimmed, to disable trimming, use the -NoTrim parameter.
Upvotes: 3