labu77
labu77

Reputation: 807

Remove complete HTML tag when characters are found

A string contains a HTML tag with a word + suffix (in this case ...rem)

Example:

<b>SomeText...rem</b>
<u>SomeText...rem</u>
<strong>SomeText...rem</strong>
<a href="/">SomeText...rem</a>
<div>SomeText...rem</div>

When the word inside the HTML Tag contain

...rem

The complete HTML Tag + word should be removed.

I can rename "...rem". Its only a marker.

Is this possible?

Upvotes: 2

Views: 154

Answers (2)

user557597
user557597

Reputation:

Thought I'd take a shot.
Using PHP, here is an exact way to do it.

update version

This uses the \K construct so there is no need to write back
Tracker data to string. Just replace with nothing.
Also gains speed doing it this way.

Formatted and tested:

 # ** Usage **
 # -----------------
 # Find: ''~(?s)(?:(?:(?&Comment)?(?!(?&RawContent)|(?&Comment)).)*\K(?(?=\z)|(?<OpenTag>(?><(?:(?<TagName>[\w:]+)(?:".*?"|\'.*?\'|[^>]*?)+)>)(?<!/>))(?<Body>(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?(?=.)(?&RawContent)(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?)(?<CloseTag>(?><(?:/\2\s*)>)))|.*?(?:(?&RawContent)|(?&Comment))\K)(?(DEFINE)(?<RawContent>\.\.\.rem)(?<Tag_Not_TargetOpen>(?><(?:(?!\2)[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>|(?&Comment)))(?<Char_Not_Tag>(?!(?><(?:[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>)|(?&Comment)).)(?<Comment>(?><(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?)))>)))~'
 # Replace: nothing

 # Dot-all modifier
 (?s)

 # Single group, two alternatives.

 (?:
      # Alternative 1 (highest priority)
      # =================================

      # This is the bactracker. This is crucial !
      # We go all the way up until we find
      # the raw content we are looking for,
      # or comments (because they could hide tags).
      # Then we backtrack from there to 
      # find the closest inner open/close tags
      # that contain our content.

      # Tracker1 - Formerly captured, was the replacements
      (?:
           (?&Comment)? 
           (?!
                (?&RawContent) 
             |  (?&Comment) 
           )
           . 
      )*

      # Prevent Tracker1 need to write back
      \K 

      # Conditional Assertion -
      # Have we reached the end of string without 
      # finding the tagged Content ?

      (?(?= \z )
           # ---------------------------------------------
           # Yes -  Don't do anything, the remainder is in
           # Tracker1 and is thrown away.
           # ---------------------------------------------

        |  
           # ---------------------------------------------
           # No - Find the tagged Content.
           # If no match, Tracker1 will backtrack 1 char and retry.
           # Here, Tracker1 will find up to the point
           # of the tagged Content and be consumed, but thrown away.
           # ---------------------------------------------

           # Get Target Open tag
           (?<OpenTag>                         # (1)
                (?>
                     <
                     (?:
                          (?<TagName> [\w:]+ )                # (2), tag name
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                     )
                     >
                )
                (?<! /> )
           )

           # Get Body containing the raw content   
           (?<Body>                            # (3)

                # Stuff before raw content
                (?&Char_Not_Tag)*? 
                (?:
                     (?&Tag_Not_TargetOpen) 
                     (?&Char_Not_Tag)*? 
                )*?

                # The raw content we need
                (?= . )
                (?&RawContent)                       

                # Stuff after raw content
                (?&Char_Not_Tag)*? 
                (?:
                     (?&Tag_Not_TargetOpen) 
                     (?&Char_Not_Tag)*? 
                )*?
           )

           # Get Target Close tag
           (?<CloseTag>                        # (4)
                (?>
                     <
                     (?: / \2 \s* )
                     >
                )
           )
      )
   |  
      # Alternative 2 (lowest priority)
      # =================================

      # Here, we've already backtracked all
      # possibilities from Tracker1.
      # At this point, we have raw content, 
      # or comments that we must get past.
      # Comments because they could hide tags.
      # Just take it off, it will be thrown away.

      # Tracker2 - Formerly captured, was the replacements
      .*? 
      (?:
           (?&RawContent) 
        |  (?&Comment) 
      )

      # Prevent Tracker2 need to write back
      \K 
 )



 # Functions
 # -----------------------
 (?(DEFINE)

      (?<RawContent>                      # (5)

           # Raw content we are looking for.
           # Note - this is content and is not contained
           # in tags nor comments.

           \.\.\.rem                           # '...rem' or whatever
      )

      (?<Tag_Not_TargetOpen>              # (6)

           # Consume any tag that
           # is not the target Open tag.
           # Comsume comment as well.
           (?>
                <
                (?:
                     (?! \2 )
                     [\w:]+ 
                     (?: " .*? " | ' .*? ' | [^>]*? )+
                )
                >
             |  
                (?&Comment) 
           )
      )

      (?<Char_Not_Tag>                    # (7)

           # Consume any charater
           # that does not begin a tag or comment
           (?!
                (?>
                     <
                     (?:
                          [\w:]+ 
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                     )
                     >
                )
             |  
                (?&Comment) 
           )
           .  
      )

      (?<Comment>                         # (8)

           # Comment
           (?>
                <
                (?:
                     !
                     (?:
                          (?: DOCTYPE .*? )
                       |  (?: \[CDATA\[ .*? \]\] )
                       |  (?: -- .*? -- )
                       |  (?: ATTLIST .*? )
                       |  (?: ENTITY .*? )
                       |  (?: ELEMENT .*? )
                     )
                )
                >
           )
      )
 )

Test case

Input:

<div>blah blah <i>some text</i> ...rem</div>
<b>SomeText...rem</b>
<u>SomeText...rem</b>
<strong>SomeText...rem</b>
<a href="/">SomeText...rem</a>
<div>SomeText...rem</div>

Output:

 **  Grp 0                      -  ( pos 0 , len 44 ) 
<div>blah blah <i>some text</i> ...rem</div>  
 **  Grp 1 [OpenTag]            -  ( pos 0 , len 5 ) 
<div>  
 **  Grp 2 [TagName]            -  ( pos 1 , len 3 ) 
div  
 **  Grp 3 [Body]               -  ( pos 5 , len 33 ) 
blah blah <i>some text</i> ...rem  
 **  Grp 4 [CloseTag]           -  ( pos 38 , len 6 ) 
</div>  

---------------------

 **  Grp 0                      -  ( pos 46 , len 21 ) 
<b>SomeText...rem</b>  
 **  Grp 1 [OpenTag]            -  ( pos 46 , len 3 ) 
<b>  
 **  Grp 2 [TagName]            -  ( pos 47 , len 1 ) 
b  
 **  Grp 3 [Body]               -  ( pos 49 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 63 , len 4 ) 
</b>  

---------------------

 **  Grp 0                      -  ( pos 86 , len 0 )  EMPTY 
 **  Grp 1 [OpenTag]            -  NULL 
 **  Grp 2 [TagName]            -  ( pos 70 , len 1 ) 
u  
 **  Grp 3 [Body]               -  NULL 
 **  Grp 4 [CloseTag]           -  NULL 

---------------------

 **  Grp 0                      -  ( pos 114 , len 0 )  EMPTY 
 **  Grp 1 [OpenTag]            -  NULL 
 **  Grp 2 [TagName]            -  ( pos 93 , len 6 ) 
strong  
 **  Grp 3 [Body]               -  NULL 
 **  Grp 4 [CloseTag]           -  NULL 

---------------------

 **  Grp 0                      -  ( pos 120 , len 30 ) 
<a href="/">SomeText...rem</a>  
 **  Grp 1 [OpenTag]            -  ( pos 120 , len 12 ) 
<a href="/">  
 **  Grp 2 [TagName]            -  ( pos 121 , len 1 ) 
a  
 **  Grp 3 [Body]               -  ( pos 132 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 146 , len 4 ) 
</a>  

---------------------

 **  Grp 0                      -  ( pos 152 , len 25 ) 
<div>SomeText...rem</div>  
 **  Grp 1 [OpenTag]            -  ( pos 152 , len 5 ) 
<div>  
 **  Grp 2 [TagName]            -  ( pos 153 , len 3 ) 
div  
 **  Grp 3 [Body]               -  ( pos 157 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 171 , len 6 ) 
</div>  

Previous version with Tracker write back.

 # ** Usage **
 # -----------------
 # Find: '~(?s)(?:(?<Tracker1>(?:(?&Comment)?(?!(?&RawContent)|(?&Comment)).)*)(?(?=\z)|(?<OpenTag>(?><(?:(?<TagName>[\w:]+)(?:".*?"|\'.*?\'|[^>]*?)+)>)(?<!/>))(?<Body>(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?(?=.)(?&RawContent)(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?)(?<CloseTag>(?><(?:/\3\s*)>)))|(?<Tracker2>.*?(?:(?&RawContent)|(?&Comment))))(?(DEFINE)(?<RawContent>\.\.\.rem)(?<Tag_Not_TargetOpen>(?><(?:(?!\3)[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>|(?&Comment)))(?<Char_Not_Tag>(?!(?><(?:[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>)|(?&Comment)).)(?<Comment>(?><(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?)))>)))~'
 # Replace: '$1$6'

Upvotes: 0

Josh Crozier
Josh Crozier

Reputation: 240958

I would strongly suggest using an HTML parser for this. However, since your question asks for a regular expression, you could use the following and replace the matches in a callback.

/(?s)<(\w+)[^>]*>(.*?)<\/\1>/

Explanation:

  • (?s) - s flag so that the . character also matches newlines characters.
  • <(\w+)[^>]*> - Match an opening HTML tag and capture the element name
  • (.*?) - Second capturing group to match the contents of the HTML tag
  • <\/\1> - Match the closing HTML tag by using a back reference based on the first capturing group (which is the tag name).

Then use the function preg_replace_callback in order to replace the match with an empty sting if the second capturing group contains the substring ...rem. Otherwise, do nothing by replacing the match with itself.

Live Example Here

preg_replace_callback('/(?s)<(\w+)[^>]*>(.*?)<\/\1>/', function ($m) {
  return strpos($m[2], '...rem') !== false ? '' : $m[0];
}, $string);

Upvotes: 1

Related Questions