Enigma
Enigma

Reputation: 123

Regex non capturing group

Regex Experts. I need some help to capture the IP address and its status from the below HTML string.

$html = "<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Active Zone : BW Zone 1[1], &nbsp;&nbsp;VIP = 192.168.254.10</div>

<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.10/checkGlobalReplicationTier>https://192.168.254.10/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[ACTIVE]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.10/checkReplication>https://192.168.254.10/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.11/checkGlobalReplicationTier>https://192.168.254.11/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[STANDBY]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.11/checkReplication>https://192.168.254.11/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Local Zones:</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LC Zone 3[3], &nbsp;&nbsp;VIP = 192.168.254.13
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.13/checkReplication>https://192.168.254.13/checkReplication</a>
&nbsp;&nbsp;[ACTIVE]</div>"
[regex]::matches($html, '((\d{1,3}\.){3}\d{1,3})((?s).*?)((?<=\[)[A-z]*(?=\]))').value

The above regex is able to get the IP and Status.. but i want to omit everything in-between the IP and Status. How do i do this with non capturing regex.

192.168.254.10  Active
192.168.254.11  Standby
192.168.254.13  Active

Upvotes: 1

Views: 1658

Answers (2)

iRon
iRon

Reputation: 23862

It is generally a bad idea to attempt to parse HTML with regular expressions.
Instead use a dedicated HTML parser as the HtmlDocument class (and the Uri class for uri's).

Example

function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}

$Document = ParseHtml $Html
$Document.getElementsByTagName('div') |ForEach-Object {
    if ($_.lastChild.nodeValue -match '\[(?<Status>ACTIVE|STANDBY)\]') {
        [pscustomobject]@{
            Ip     = ([Uri]$($_.getElementsByTagName('a')).href).Host
            Status = $Matches.Status
        }
    }
}

Ip             Status
--             ------
192.168.254.10 ACTIVE
192.168.254.11 STANDBY
192.168.254.13 ACTIVE

Upvotes: 3

mklement0
mklement0

Reputation: 440536

Generally, consider iRon's helpful answer for robust HTML parsing with a dedicated parser.

How do i do this with non capturing regex.

You can't, because in order to exclude parts of the matching span of text you'd need look-around assertions (such as the negative look-behind assertions in your attempt, e.g. (?<=\[)), but these in turn prevent you from consuming the unwanted parts of the span.

Instead, use two capture groups and access them as follows:

[regex]::Matches(
  $html, 
  '(?s)((?:\d{1,3}\.){3}\d{1,3}).+?\[([A-Z]+)\]'
 ) | ForEach-Object { 
   [pscustomobject] @{ 
     Ip = $_.Groups[1].Value
     Status = $_.Groups[2].Value
   } 
 }

This results in the following display output:

Ip             Status
--             ------
192.168.254.10 ACTIVE
192.168.254.11 STANDBY
192.168.254.13 ACTIVE

Upvotes: 0

Related Questions