Reputation: 67
I need to replace some text in C# using RegEx:
string strSText = "<P>Bulleted list</P><UL><P><LI>Bullet 1</LI><P></P><P>
<LI>Bullet 2</LI><P></P><P><LI>Bullet 3</LI><P></UL>"
Basically I need to get rid of the
"<P>"
tag(s) introduced between
"<UL><P><LI>",
"</LI><P></P><P><LI>" and
"</LI><P></UL>"
I also need to ignore any spaces between these tags when performing the removal.
So
"</LI><P></P><P><LI>", "</LI> <P></P><P><LI>", "</LI><P></P><P> <LI>" or
"</LI> <P> </P> <P> <LI>"
must all be replaced with
"</LI><LI>"
I tried using the following RegEx match for this purpose:
strSText = Regex.Replace(strSText, "<UL>.*<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*<LI>", "</LI><LI>",
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*</UL>", "</LI></UL>",
RegexOptions.IgnoreCase);
But it performs a "greedy" match and results in:
"<P>Bulleted list</P><UL><LI>Bullet 3</LI></UL>"
I then tried using "lazy" match:
strSText = Regex.Replace(strSText, "<UL>.*?<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?<LI>", "</LI><LI>",
RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, "</LI>.*?</UL>", "</LI></UL>",
RegexOptions.IgnoreCase);
and this results in:
"<P>Bulleted list</P><UL><LI>Bullet 1</LI></UL>"
But I want the following result, which preserves all other data:
"<P>Bulleted list</P><UL><LI>Bullet 1</LI><LI>Bullet 2</LI><LI>Bullet 3</LI></UL>"
Upvotes: 1
Views: 561
Reputation: 56
Not really an answer to your question, but more of a comment to Jonathon: Parse HTML with HTMLAgilityPack
Upvotes: 1
Reputation: 3991
The following regexp matches one or more <P>
or </P>
tags:
(?:</?P>\s*)+
So if you place that between the other tags you have, you can get rid of them, i.e.
strSText = Regex.Replace(strSText, @"<UL>\s*(?:</?P>\s*)+<LI>", "<UL><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+<LI>", "</LI><LI>", RegexOptions.IgnoreCase);
strSText = Regex.Replace(strSText, @"</LI>\s*(?:</?P>\s*)+</UL>", "</LI></UL>", RegexOptions.IgnoreCase);
Upvotes: 1