Ian W
Ian W

Reputation: 4767

Powershell regex multiple match per line

Having a little trouble constructing a Powershell Replace regex that's not too greedy.

Looking to convert occurrences of this pattern: /sites/*/*/SitePages/*/*.aspx to: /sites/*/*/SitePages/*/*.html

But having an issue where there's multiple values on the one line to be replaced. replace's greediness is capturing the whole line, replacing only the last.

sample input:

<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>

failing regex segment:

% { $_ -Replace '(sites.*SitePages.*)\.aspx' , '${1}.html' }

Suggestions?

(motivation: I am trying to convert the aspx page references to html as we've moved from hosting on SharePoint. Pages are all static, so no issues, other than converting the page extensions)

Upvotes: 0

Views: 401

Answers (4)

iRon
iRon

Reputation: 23852

Just as you stated yourself, using a regular expression to peek and poke in a structured string might give unexpected and greedy results. As suggested before, it is generally a bad idea to attempt to parse HTML with regular expressions. Instead use a dedicated HTML parser as the HtmlDocument class (and the Uri class for uri's).

Example

$html = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'

function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}

$Document = ParseHtml $Html
# You might also select your div from a presumably larger document:
# $div = $Document.getElementsByClassName('ms-wikicontent')
$Document.getElementsByTagName('a') |ForEach-Object {
    if (([Uri]$_.href).LocalPath -like '/sites/*/*/SitePages/*.aspx') {
        $_.href = [System.IO.Path]::ChangeExtension($_.href, 'html')
    }
}
$Document.body.innerHtml

result:

<DIV class="ms-wikicontent ms-rtestate-field" style="PADDING-RIGHT: 10px">
<DIV class=ExternalClass8E56354CC4314DBA861E187B689F3A2B>
<TABLE id=layoutsTable style="WIDTH: 100%">
<TBODY>
<TR style="VERTICAL-ALIGN: top">
<TD style="WIDTH: 100%">
<DIV class=ms-rte-layoutszone-outer style="WIDTH: 100%">
<DIV aria-haspopup=true role=textbox aria-multiline=true class=ms-rte-layoutszone-inner aria-autocomplete=both><A id=0::Home|Home class=ms-wikilink href="/sites/Team/Project/SitePages/Home.html">Home</A> - <A id=1::Jenkins|Jenkins class=ms-wikilink href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</A>
<H1 class=ms-rteElement-H1>Jenkins Integration with Deployment Tools</H1></DIV></DIV></TR></TBODY></DIV></DIV>

Upvotes: 5

The fourth bird
The fourth bird

Reputation: 163632

Without lookarounds, you can use a capture group like in your question. But when matching you should not cross the " as the string in between double quotes.

(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b

Explanation

  • ( Capture group 1
    • /sites\b Match sites and a word boundary
    • [^\"]*/SitePages/ Optionally match any char except " and then match /SitePages/
    • [^\"]+ Match 1+ chars other than "
  • ) Close group 1
  • \.aspx\b Match .aspx and a word boundary

See a regex demo.

$input = @"
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
"@

$input -replace '(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b' ,'$1.html'

Output

<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.html">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>

Another variation if there are always 2 parts with / you can do an exact repetition with a quantifier {2} and for example assert the double quote after .aspx

(/sites(?:/[^/\"]+){2}/SitePages/[^/\"]+)\.aspx(?=\")

See another regex demo.

Upvotes: 2

zett42
zett42

Reputation: 27806

Daniel already showed an excellent solution using character exclusion [^/]:

$_ -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html'

Alternatively you could use the lazy modifier ?:

$_ -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html'

While the latter looks cleaner, it is less performant, because it requires more backtracking.

I did a little benchmark:

$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'

$runs = 100000
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html' }}).TotalMilliseconds
$lazyMillis    = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html' }}).TotalMilliseconds

[PSCustomObject]@{
    RegExExclude = '{0} ms'        -f [int]$excludeMillis
    RegExLazy    = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}

Output from PS 7.2:

RegExExclude RegExLazy    
------------ ---------
281 ms       350 ms (125%)

The difference is noticable, but not that big, so you may go for readability if performance doesn't matter.


The performance difference between the two becomes even smaller when using a compiled RegEx:

$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'

$runs = 100000

$rxExclude = [regex]::new( '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$rxLazy    = [regex]::new( '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )

$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxExclude.Replace( $text, 'html' ) }}).TotalMilliseconds
$lazyMillis    = (Measure-Command { foreach( $i in 1..$runs ) { $rxLazy.Replace( $text, 'html' ) }}).TotalMilliseconds

[PSCustomObject]@{
    RegExExclude = '{0} ms'        -f [int]$excludeMillis
    RegExLazy    = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}

Output from PS 7.2:

RegExExclude RegExLazy
------------ ---------
160 ms       178 ms (111%)

Upvotes: 1

Harikrishnan
Harikrishnan

Reputation: 49

try

[string]$string = "<div class='ms-wikicontent ms-rtestate-field' style='padding-right: 10px'><div class='ExternalClass8E56354CC4314DBA861E187B689F3A2B'><table id='layoutsTable' style='width:100%'><tbody><tr style='vertical-align:top'><td style='width:100%'><div class='ms-rte-layoutszone-outer' style='width:100%'><div class='ms-rte-layoutszone-inner' role='textbox' aria-haspopup='true' aria-autocomplete='both' aria-multiline='true'><a id='0::Home|Home' class='ms-wikilink' href='/sites/Team/Project/SitePages/Home.aspx'>Home</a> - <a id='1::Jenkins|Jenkins' class='ms-wikilink' href='/sites/Team/Project/SitePages/Jenkins.aspx'>Jenkins</a><h1 class='ms-rteElement-H1'>Jenkins Integration with Deployment Tools</h1>"

$string.Replace('.aspx','.html')

or if you looking for build regex. Check out https://rubular.com/ it helps to build regex expressions

Upvotes: -1

Related Questions