Reputation: 101
From the set more then 10000 rows of text, I need to find all instances of string where space after a set of html tags are missing. Set of HTML tags are limited they are as follow.
<b> </b>, <em> </em>, <span style="text-decoration: underline;" data-mce-style="text-decoration: underline;"> </span>
<sub> </sub>, <sup> </sup>, <ul> </ul>, <li> </li>, <ol> </ol>
After running Regx following string should come in result.
Hi <b>all</b>good morning.
As in this case we have missed sapce after bold tag.
Upvotes: 2
Views: 627
Reputation: 336258
Assuming C#:
StringCollection resultList = new StringCollection();
Regex regexObj = new Regex("^.*<(?:/?b|/?em|/?su[pb]|/?[ou]l|/?li|span style=\"text-decoration: underline;\" data-mce-style=\"text-decoration: underline;\"|/span)>(?! ).*$", RegexOptions.Multiline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
matchResult = matchResult.NextMatch();
}
will return all lines in your file where there's at least one space after one of the tags in your list.
Input:
This </b> is <b> OK
This <b> is </b>not OK
Neither <b>is </b> this.
Output:
This <b> is </b>not OK
Neither <b>is </b> this.
Explanation:
^ # Start of line
.* # Match any number of characters except newlines
< # Match a <
(?: # Either match a...
/?b # b or /b
| # or
/?em # em or /em
|... # etc. etc.
) # End of alternation
> # Match a >
(?! ) # Assert that no space follows
.* # Match any number of characters until...
$ # End of line
Upvotes: 3