Enigma
Enigma

Reputation: 123

Parsing <div> HTML content with  

I have the below monitoring link output which i am trying parse to variable.

<html>
<head>
<style type="text/css"></style>
</head>
<body>
<div style="float:left;margin-right:50px">
<div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED:


<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: DC1 NY [ENABLED]
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Active Zone : BW Zone 1[1], &nbsp;&nbsp;VIP = 192.168.254.10</div>

<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.10/checkGlobalReplicationTier>https://192.168.254.10/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[ACTIVE]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.10/checkReplication>https://192.168.254.10/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.11/checkGlobalReplicationTier>https://192.168.254.11/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[STANDBY]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.11/checkReplication>https://192.168.254.11/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Local Zones:</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LC Zone 3[3], &nbsp;&nbsp;VIP = 192.168.254.13
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.254.13/checkReplication>https://192.168.254.13/checkReplication</a>
&nbsp;&nbsp;[ACTIVE]</div>


<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: DC2 NJ [ENABLED]
&nbsp;[DEFAULT DC]</div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Active Portal Zone : BW Zone 2[2], &nbsp;&nbsp;VIP = 192.168.253.10</div>

<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.10/checkGlobalReplicationTier>https://192.168.253.10/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[ACTIVE]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.10/checkReplication>https://192.168.253.10/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.11/checkGlobalReplicationTier>https://192.168.253.11/checkGlobalReplicationTier</a>
&nbsp;&nbsp;[STANDBY]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.11/checkReplication>https://192.168.253.11/checkReplication</a></div>
<div><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Local Zones:</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; LC Zone 4[4], &nbsp;&nbsp;VIP = 192.168.253.13
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.13/checkReplication>https://192.168.253.13/checkReplication</a>
&nbsp;&nbsp;[ACTIVE]</div>
<div>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href=https://192.168.253.14/checkReplication>https://192.168.253.14/checkReplication</a>
&nbsp;&nbsp;[STANDBY]</div>



--> </div>
</div>
</body>
</html>

i would like to parse this to get

Data Center                    Active Zone      VIP             Local Zone   VIP
DC1 NY [Enabled]               BW Zone 1[1]   192.168.254.10  LC Zone 3[3]  192.168.254.13
DC2 NJ [Enabled] [DEFAULT DC]  BW Zone 2[2]   192.168.253.10  LC Zone 4[4]  192.168.253.13 

The code seems to be not able to parse and is Regex is the best way to parse this page or should i try some other way.

$zone = "https://192.168.0.90/checkConfiguration"
$html = Invoke-WebRequest -Uri $zone -ErrorAction Stop
$DC = ($html.ParsedHtml.getElementsByTagName('div') |  Where-Object { $_.InnerHTML -like '<div><br><br>&nbsp;&nbsp;&nbsp;&nbsp; DataCenter: *' })  |  Foreach-Object {$_.outerText -replace '(?<!:.*):', '='} | %{new-object psobject -prop (ConvertFrom-StringData $_)}

Upvotes: 0

Views: 495

Answers (1)

Theo
Theo

Reputation: 61028

For that you could do this:

$div = $html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>*DataCenter:*' }
$DC = if ($div -and $div.outerText -match '(?s)DataCenter\s*:\s*(\w+).*Active Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d\.]+)') {
    [PsCustomObject]@{
        'DataCenter'  = $matches[1]
        'Active Zone' = $matches[2]
        'VIP'         = $matches[3]
    }
}

$DC | Format-Table -AutoSize

Output:

DataCenter Active Zone VIP         
---------- ----------- ---         
DC1        BW Zone     192.168.0.95

or as List

$DC | Format-List

Output:

DataCenter  : DC1
Active Zone : BW Zone
VIP         : 192.168.0.95

Here's a different approach when multiple datacenters are in the html file:

# use outerText to get the plain text for the surrounding <div>DATA CENTERS WITH GLOBAL REPLICATION TIER ENABLED/SUSPENDED ...</div>
$content = ($html.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.innerHtml -like '<div>DATA CENTERS*' }).outerText
$DC = $content -split 'DataCenter\s*:\s*' |
      Where-Object { $_ -match '(?s)([\w ]+(?:[ [\w\]]*)).*Active (?:Portal )?Zone\s*:\s*([^,]+),\s+VIP\s*=\s*([\d.]+)' } | 
      ForEach-Object { 
        [PsCustomObject]@{
            'DataCenter'  = $matches[1]
            'Active Zone' = $matches[2]
            'VIP'         = $matches[3]
        }
      }

$DC | Format-Table -AutoSize 

Output:

DataCenter                     Active Zone  VIP           
----------                     -----------  ---           
DC1 NY [ENABLED]               BW Zone 1[1] 192.168.254.10
DC2 NJ [ENABLED]  [DEFAULT DC] BW Zone 2[2] 192.168.253.10

Regex details:

(?s)                  Match the remainder of the regex with the options: dot matches newline (s)
(                     Match the regular expression below and capture its match into backreference number 1
   [\w ]              Match a single character present in the list below
                      A word character (letters, digits, etc.)
                      The character “ ”
      +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:                Match the regular expression below
      [ [\w\]]        Match a single character present in the list below
                      One of the characters “ [”
                      A word character (letters, digits, etc.)
                      A ] character
         *            Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   )                 
)                    
.                     Match any single character
   *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Active\               Match the characters “Active ” literally
(?:                   Match the regular expression below
   Portal\            Match the characters “Portal ” literally
)?                    Between zero and one times, as many times as possible, giving back as needed (greedy)
Zone                  Match the characters “Zone” literally
\s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
   *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
:                     Match the character “:” literally
\s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
   *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(                     Match the regular expression below and capture its match into backreference number 2
   [^,]               Match any character that is NOT a “,”
      +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)                    
,                     Match the character “,” literally
\s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
   +                  Between one and unlimited times, as many times as possible, giving back as needed (greedy)
VIP                   Match the characters “VIP” literally
\s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
   *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
=                     Match the character “=” literally
\s                    Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
   *                  Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(                     Match the regular expression below and capture its match into backreference number 3
   [\d.]              Match a single character present in the list below
                      A single digit 0..9
                      The character “.”
      +               Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)

Upvotes: 1

Related Questions