Reputation: 449
I'm working on a small assignment that requires the use of regular expressions with HTML strings. My current problem is properly obtaining strings enclosed within HTML tags.
For instance:
I have a string
<p><Placeholder></p>
I've been able to obtain the contents with the following regex
private string Unescape(){
string s = WebUtility.HtmlDecode("<p><Placeholder></p>");
string dec = Regex.Replace(s, "^<.*?>|^<.*?><.*?>", "");
return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}
Which would return:
<Placeholder>
However, should the string contain an additional HTML tag, e.g.:
<p><strong>Placeholder</strong></p>
I would get this
<strong>Placeholder
It appears I'm only able to successfully remove the closing tag(s), but I can't do the same with the opening tag(s). Could anybody tell me where I've gone wrong?
EDIT:
To summarize, is there a way for me to treat the string enclosed within HTML tags as literal? To cover the possibility that the string could contain special characters (e.g. > <)
Upvotes: 0
Views: 926
Reputation: 92976
I am not sure if your will get happy with your regex usage on html, but I want to explain what the problem for your "mis"match is:
An alternation will use the first match it will find and will not look for further matches. So when you search at the start for
^<.*?>|^<.*?><.*?>
on the string
<p><strong>Placeholder</strong></p>
It will match on the first alternative and therefore it will end with a successful match on the first alternative. So if you want to match <p><strong>
at the start you should change the ordering in the alternation. but only for the part at the start of the string, for the end of the string your ordering is fine.
So for your example this would work:
private string Unescape(){
string s = WebUtility.HtmlDecode("<p><Placeholder></p>");
string dec = Regex.Replace(s, "^<.*?><.*?>|^<.*?>", "");
return Regex.Replace(dec, "</.*?>$|</.*?></.*?>$", "");
}
==> The ordering inside an alternation can be important
An alternative would be to use a quantifier instead of an alternation:
string dec = Regex.Replace(s, "^(?:<.*?>)+", "");
return Regex.Replace(dec, "(?:</.*?>)+$", "");
this would work also for more than 2 tags.
Upvotes: 1