Saharcasm
Saharcasm

Reputation: 148

Capturing HTML comments using Regex but ignoring a certain comment

I want to capture html comments with the exception of a specific comment i.e,

 <!-- end-readmore-item --> 

At the moment, I can successfully capture all of the HTML comments using the regex below,

(?=<!--)([\s\S]*?)-->

To ignore the specified comment, I have tried the lookahead and lookbehind assertions but being new at the advanced level of Regex I am probably missing out on something.

So far, I have been able to devise the following regex using lookarounds,

^((?!<!-- end-readmore-item -->).)*$

I expect it to ignore the end-readmore-item comment and only capture other comments such as,

<!-- Testing-->

However, it does the job but also captures the regular HTML tags which I want to be ignored as well.

I have been using the following html code as a test case,

<div class="collapsible-item-body" data-defaulttext="Further text">Further 
text</div>
<!-- end-readmore-item --></div>
</div>
&nbsp;<!-- -->
it only should match with <!-- --> but it's selecting everything except <!-- 
end-readmore-item -->
the usage of this is gonna be to remove all the HTML comments except <!-- 
end-readmore-item -->

Upvotes: 3

Views: 81

Answers (2)

41686d6564
41686d6564

Reputation: 19641

You can use the following pattern:

<!--(?!\s*?end-readmore-item\s*-->)[\s\S]*?-->

Regex101 demo.

Breakdown:

<!--                    # Matches `<!--` literally.
(?!                     # Start of a negative Lookahead (not followed by).
    \s*                 # Matches zero or more whitespace characters.
    end-readmore-item   # Matches literal string.
    \s*                 # Matches zero or more whitespace characters.
    -->                 # Matches `-->` literally.
)                       # End of the negative Lookahead.
[\s\S]*?                # Matches any character zero or more time (lazy match), 
                        # including whitespace and non-whitespace characters.
-->                     # Matches `-->` literally.

Which basically means:

Match <!-- that is not followed by [a whitespace* + end-readmore-item + another whitespace* + -->] and which is followed by any amount of characters then immediately followed by -->.


* An optional whitespace repeated zero or more times.

Upvotes: 2

Dovi Salomon
Dovi Salomon

Reputation: 159

You are very close with your negative lookahead assertion, you just need to modify it as follows:

<!--((?!end-readmore-item).)*?-->

Where *? matched non-greedily.

This will match all comments except those that contain the string end-readmore-item inside the comment body.

Upvotes: 1

Related Questions