Reputation: 1606
trying to use regex to replace any white space with " ", inside of example html
<span someattr="a">and some words with spaces</span>
It's a desktop app and this html is coming to/from a third party control and don't have the luxury of working with any type of html parsing so am stuck with regex
I can't seem come up with a regex that would just match any whitespace inside any number of span tags.
Thanks
Upvotes: 0
Views: 2974
Reputation: 617
Replace all occurrences of the following with " "
:
(?<=<span\b[^>]*>(?:(?!</?span\b).)*(?(ReverseDepth)(?!))(?:(?:<span\b[^>]*>(?<-ReverseDepth>)|</span>(?<ReverseDepth>))(?:(?!</?span\b).)*)*)\u0020(?![^<]*>)
This should work for any depth of span elements no matter what other elements are present. Note that this will only work for .net regular expressions.
This regex is very finicky. Be careful if you try to change anything.
Thanks to moonshadow for pointing out the fancy open-close matching syntax in .net regexes.
Upvotes: 1
Reputation: 96507
How about this? Note that the code block is eating up the
so I separated the ampersand from the rest of the text to make it visible. The line inside the regex replace actually reads:
m.Groups["text"].Value.Replace(" ", " ")
Here's the sample:
string html = @"<span someattr=""a"">and some words with spaces</span>";
string pattern = @"<(?<tag>\w*)(?<attributes>[^>]+)?>(?<text>.*)</\k<tag>>";
string result = Regex.Replace(html, pattern,
m => String.Format("<{0}{1}>{2}</{0}>",
m.Groups["tag"].Value,
m.Groups["attributes"].Value,
m.Groups["text"].Value.Replace(" ", "& nbsp;")
)
);
Result = <span someattr="a">and some words with spaces</span>
Things will get complicated quickly if you have nested span tags, however.
EDIT: reconstructed tag and attributes, added string format to tidy things up
Upvotes: 0
Reputation: 25543
This appears to work, but I'd definitely do some serious unit testing (and code cleanup) first. This is based on section 3.17 of the Regular Expression Cookbook combined with a library snippet from RegexBuddy. (NOTE: Will not work with nested span tags.)
public class MyClass
{
private static Regex outerRegex = new Regex("(?<=<span[^>]*>).*?(?=</span>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
private static Regex innerRegex = new Regex(@"\s");
public static void Main()
{
string subjectString = "my dog has <span someattr=\"a\">" +
"and some words with spaces</span> fleas" +
"<frog>space z</frog> <span> </span>";
string resultString = outerRegex.Replace(subjectString,
new MatchEvaluator(ComputeReplacement));
Console.WriteLine(resultString);
}
public static string ComputeReplacement(Match matchResult)
{
// Run the inner search-and-replace on each match of the outer regex
// (the string was not getting escaped so I broke it up)
return innerRegex.Replace(matchResult.Value, "&" + "nbsp;");
}
}
Upvotes: 0
Reputation: 22240
This could potentially be very slow with very large strings.
But this works:
(?<=\<span[^>]*>[^<]+)\s(?=[^<]+\</span>)
With a replacement string of:
The reason I say it might be slow is that it's having to find the whitespace (\s) and then search towards the left and to the right to see if it's surrounded by a span tag. And it'll have to do the same thing for every character of whitespace individually. But I believe this should work reliably as long as your HTML is well-formed and you aren't dealing with nested span tags.
And by the way, since this is for .NET you can use Regex Hero to build the code for you:
string strRegex = "(?<=\<span[^>]*>[^<]+)\s(?=[^<]+\</span>)";
RegexOptions myRegexOptions = RegexOptions.None;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = "<span someattr=\"a\">and some words with spaces</span>";
string strReplace = " ";
return myRegex.Replace(strTargetString, strReplace);
Upvotes: 1
Reputation: 36522
Semi-related, in looking for a solution for this, I found a php-based perl regular expression article that may or may not be helpful for .net:
Upvotes: 0
Reputation: 89105
Regex on its own is a poor fit for nested data. Your best bet if you can't use a third-party parser is to bite the bullet and write some code - perhaps using a parser generator - to parse the nesting.
(That said, check the documentation for your regexp library; you may find it has extensions to aid parsing of nested data, e.g. .net's balancing groups construct)
Upvotes: 1