Dom
Dom

Reputation: 101

Regex to find missing space after html tags

From the set more then 10000 rows of text, I need to find all instances of string where space after a set of html tags are missing. Set of HTML tags are limited they are as follow.

<b> </b>, <em> </em>, <span style="text-decoration: underline;" data-mce-style="text-decoration: underline;"> </span> <sub> </sub>, <sup> </sup>, <ul> </ul>, <li> </li>, <ol> </ol>

After running Regx following string should come in result.

Hi <b>all</b>good morning.

As in this case we have missed sapce after bold tag.

Upvotes: 2

Views: 627

Answers (1)

Tim Pietzcker
Tim Pietzcker

Reputation: 336258

Assuming C#:

StringCollection resultList = new StringCollection();
Regex regexObj = new Regex("^.*<(?:/?b|/?em|/?su[pb]|/?[ou]l|/?li|span style=\"text-decoration: underline;\" data-mce-style=\"text-decoration: underline;\"|/span)>(?! ).*$", RegexOptions.Multiline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
} 

will return all lines in your file where there's at least one space after one of the tags in your list.

Input:

This </b> is <b> OK
This <b> is </b>not OK
Neither <b>is </b> this.

Output:

This <b> is </b>not OK
Neither <b>is </b> this.

Explanation:

^      # Start of line
.*     # Match any number of characters except newlines
<      # Match a <
(?:    # Either match a...
 /?b   #  b or /b
|      # or 
 /?em  #  em or /em
|...   # etc. etc.
)      # End of alternation
>      # Match a >
(?! )  # Assert that no space follows
.*     # Match any number of characters until...
$      # End of line

Upvotes: 3

Related Questions