Reputation: 407
Our CMS is (I suppose correctly) encoding comma characters in URLs. So instead of being "?values=1,2,3" the CMS is rendering "?values=1%2c2%2c3". This in itself is not a problem however the external system that these links are pointing at cannot handle the encoded commas and only works if we pass actual commas in the query string.
We already have a Regex clean-up tool that processes the HTML pre-render and cleans out non XHTML compliant mark-up. This is an old CMS running on ASP.Net v2.
My question is what regular expression would be required to swap out all occurrences of "%2c" for a comma, but only where this text exists within an anchor tag. I've been easily able to swap out all instances of %2c but this runs the risk of corrupting the page elsewhere if that string happened to be used for a non-URL purpose.
I'm using .Net and System.Text.RegularExpressions. We have an XML file that contains all of the Find and Replace rules. This gets loaded at runtime and cleans the HTML. Each rule consists of:
"<script>
" "<script type='text/javascript'>"
We then have some C# that loops over each of the rules and does the following:
// HTML = full page HTML
Regex regex = new Regex(searchTxt, RegexOptions.IgnoreCase);
HTML = regex.Replace(HTML, replaceTxt);
Simple as that. I just can't get the right regex syntax for our specific scenario.
Many thanks for your help.
class Program
{
static void Main(string[] args)
{
string html = GetPageHTML();
string regexString = "(<a href=).*|(%2c)";
string replaceTxt = ",";
RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Multiline;
Regex regex = new Regex(regexString, options);
// We are currently using a simple regex.Replace
string cleanHTML = regex.Replace(html, replaceTxt);
// But for this example should we be doing something with the Matches collection?
foreach (Match match in regex.Matches(html))
{
if (match.Success)
{
// do something?
}
}
}
private static string GetPageHTML()
{
return @"<html>
<head></head>
<body>
<a title='' href='http://www.testsite.com/?x=491191%2cy=291740%2czoom=6%2cbase=demo%2clayers=%2csearch=text:WE9%203QA%2cfade=false%2cmX=0%2cmY=0' target='_blank'>A link</a>
<p>We wouldn't want this (%2c) to be replaced</p>
</body>
</html>";
}
}
Upvotes: 0
Views: 686
Reputation: 31045
If .net would support pcre regex you could do something like this:
^(?!<a href=").*(*SKIP)(*FAIL)|(%2c)
That is what you want. Above regex will match only %2c
inside anchor tags. But you could achieve the same if you use regex the regex discard technique plus some logic.
If you use below regex, you could match %2c
and also capture the %2c
string that is within anchor tags:
^(?!<a href=").*|(%2c)
So, what you can do is to add logic and to check if the capturing group content is equal to %2c
, in that case means that it matches %2c
from the anchor tag. Then you can replace that for a comma.
Upvotes: 2