Reputation: 380
I want to extract this text
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)
from this html block
<span id='tid-span-369523'><a id="tid-link-369523" href="" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
I'm trying to set this code but nothing is written on output2.txt
$html = Get-Content -Path 'C:\temp\html\metalarea2.html' -Raw
$pattern = '<span id="tid-span-\\d+"><a id="tid-link-\\d+" href=".+?" title=".+?">(.+?)</a></span>'
$matches = Select-String -InputObject $html -Pattern $pattern -AllMatches
$result = $matches | % { $_.Matches } | % { $_.Groups[1].Value }
$result | Out-File -FilePath "C:\temp\html\output2.txt"
I don't understand where the problem lies
$pattern = '<span id=\x27tid-span-\d+\x27><a id="tid-link-\d+" href=".+?" title=".+?">(.+?)</a></span>'
$pattern = '<a id="tid-link-\d+".+?>(.+?)</a>'
Upvotes: 0
Views: 1590
Reputation: 23623
It is generally a bad idea to peek and/or poke in structured text using regular expressions. Instead, it is better to use a proper (html) parser to manipulate your data.
To give you an example using the IHTMLDocument2 interface
$Html = @'
<span id="tid-span-369523"><a id="tid-link-369523" href="" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
else {
$Document = ParseHtml $Html
$Document.getElementsByTagName('a') |
Where-Object { $ -Like 'tid-link-*' } |
Foreach-Object { $_.innerText }
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)
Upvotes: 2
Reputation: 961
You can use below regular expression to capture plain text between HTML tags:
You can refer to this example from Live sample
Here is a full script example:
$html = @"
<span id="tid-span-369523"><a id="tid-link-369523" href="" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
$pattern = '(<[^>]*>)+(?<plaintext>[^<]+)<\/[^>]*>'
$options = [System.Text.RegularExpressions.RegexOptions]::Multiline
$matches = [regex]::Matches($html, $pattern, $options)
$results = $matches | %{ $_.Groups["plaintext"].Value }
Upvotes: 1