Reputation: 4767
Having a little trouble constructing a Powershell Replace
regex that's not too greedy.
Looking to convert occurrences of this pattern: /sites/*/*/SitePages/*/*.aspx
to: /sites/*/*/SitePages/*/*.html
But having an issue where there's multiple values on the one line to be replaced. replace
's greediness is capturing the whole line, replacing only the last.
sample input:
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
failing regex segment:
% { $_ -Replace '(sites.*SitePages.*)\.aspx' , '${1}.html' }
Suggestions?
(motivation: I am trying to convert the aspx page references to html as we've moved from hosting on SharePoint. Pages are all static, so no issues, other than converting the page extensions)
Upvotes: 0
Views: 401
Reputation: 23852
Just as you stated yourself, using a regular expression to peek and poke in a structured string might give unexpected and greedy results. As suggested before, it is generally a bad idea to attempt to parse HTML with regular expressions. Instead use a dedicated HTML parser as the HtmlDocument class (and the Uri class for uri's).
Example
$html = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
$Html.IHTMLDocument2_Write($Unicode)
}
else {
$Html.write($Unicode)
}
$Html.Close()
$Html
}
$Document = ParseHtml $Html
# You might also select your div from a presumably larger document:
# $div = $Document.getElementsByClassName('ms-wikicontent')
$Document.getElementsByTagName('a') |ForEach-Object {
if (([Uri]$_.href).LocalPath -like '/sites/*/*/SitePages/*.aspx') {
$_.href = [System.IO.Path]::ChangeExtension($_.href, 'html')
}
}
$Document.body.innerHtml
result:
<DIV class="ms-wikicontent ms-rtestate-field" style="PADDING-RIGHT: 10px">
<DIV class=ExternalClass8E56354CC4314DBA861E187B689F3A2B>
<TABLE id=layoutsTable style="WIDTH: 100%">
<TBODY>
<TR style="VERTICAL-ALIGN: top">
<TD style="WIDTH: 100%">
<DIV class=ms-rte-layoutszone-outer style="WIDTH: 100%">
<DIV aria-haspopup=true role=textbox aria-multiline=true class=ms-rte-layoutszone-inner aria-autocomplete=both><A id=0::Home|Home class=ms-wikilink href="/sites/Team/Project/SitePages/Home.html">Home</A> - <A id=1::Jenkins|Jenkins class=ms-wikilink href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</A>
<H1 class=ms-rteElement-H1>Jenkins Integration with Deployment Tools</H1></DIV></DIV></TR></TBODY></DIV></DIV>
Upvotes: 5
Reputation: 163632
Without lookarounds, you can use a capture group like in your question. But when matching you should not cross the "
as the string in between double quotes.
(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b
Explanation
(
Capture group 1
/sites\b
Match sites
and a word boundary[^\"]*/SitePages/
Optionally match any char except "
and then match /SitePages/
[^\"]+
Match 1+ chars other than "
)
Close group 1\.aspx\b
Match .aspx
and a word boundarySee a regex demo.
$input = @"
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
"@
$input -replace '(/sites\b[^\"]*/SitePages/[^\"]+)\.aspx\b' ,'$1.html'
Output
<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.html">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.html">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>
Another variation if there are always 2 parts with /
you can do an exact repetition with a quantifier {2}
and for example assert the double quote after .aspx
(/sites(?:/[^/\"]+){2}/SitePages/[^/\"]+)\.aspx(?=\")
See another regex demo.
Upvotes: 2
Reputation: 27806
Daniel already showed an excellent solution using character exclusion [^/]
:
$_ -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html'
Alternatively you could use the lazy modifier ?
:
$_ -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html'
While the latter looks cleaner, it is less performant, because it requires more backtracking.
I did a little benchmark:
$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'
$runs = 100000
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html' }}).TotalMilliseconds
$lazyMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html' }}).TotalMilliseconds
[PSCustomObject]@{
RegExExclude = '{0} ms' -f [int]$excludeMillis
RegExLazy = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}
Output from PS 7.2:
RegExExclude RegExLazy
------------ ---------
281 ms 350 ms (125%)
The difference is noticable, but not that big, so you may go for readability if performance doesn't matter.
The performance difference between the two becomes even smaller when using a compiled RegEx:
$text = '<div class="ms-wikicontent ms-rtestate-field" style="padding-right: 10px"><div class="ExternalClass8E56354CC4314DBA861E187B689F3A2B"><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div class="ms-rte-layoutszone-outer" style="width:100%"><div class="ms-rte-layoutszone-inner" role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" class="ms-wikilink" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" class="ms-wikilink" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 class="ms-rteElement-H1">Jenkins Integration with Deployment Tools</h1>'
$runs = 100000
$rxExclude = [regex]::new( '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$rxLazy = [regex]::new( '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxExclude.Replace( $text, 'html' ) }}).TotalMilliseconds
$lazyMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxLazy.Replace( $text, 'html' ) }}).TotalMilliseconds
[PSCustomObject]@{
RegExExclude = '{0} ms' -f [int]$excludeMillis
RegExLazy = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}
Output from PS 7.2:
RegExExclude RegExLazy
------------ ---------
160 ms 178 ms (111%)
Upvotes: 1
Reputation: 49
try
[string]$string = "<div class='ms-wikicontent ms-rtestate-field' style='padding-right: 10px'><div class='ExternalClass8E56354CC4314DBA861E187B689F3A2B'><table id='layoutsTable' style='width:100%'><tbody><tr style='vertical-align:top'><td style='width:100%'><div class='ms-rte-layoutszone-outer' style='width:100%'><div class='ms-rte-layoutszone-inner' role='textbox' aria-haspopup='true' aria-autocomplete='both' aria-multiline='true'><a id='0::Home|Home' class='ms-wikilink' href='/sites/Team/Project/SitePages/Home.aspx'>Home</a> - <a id='1::Jenkins|Jenkins' class='ms-wikilink' href='/sites/Team/Project/SitePages/Jenkins.aspx'>Jenkins</a><h1 class='ms-rteElement-H1'>Jenkins Integration with Deployment Tools</h1>"
$string.Replace('.aspx','.html')
or if you looking for build regex. Check out https://rubular.com/ it helps to build regex expressions
Upvotes: -1