Reputation: 101

NotePad++ Regular expression to remove HTML tag containing embedded tags

Using Notepad++, a department of technical writers needs to remove the xxx tags from texts like this:

`<span class="temp">See</span> Problems pane <span class="temp">for more <b>information</b>.</span>`

(Clarification:) The desired result is the inner text of the elements without the span tags. The output of the above example would be:

 `See Problems pane for more <b>information</b>.`

What I think I need is something like this:
Find: <span..>(capture anything except "")
Replace: \1

I cannot use ([^<])* as a capture group because of other tags in the span, like the  in the example.

I cannot use (.*) because there may be two such on a line.

I have tried using the entire tag close with non-greedy syntax and counting {1} syntax using examples I found in other posts, but I can't get it to work.

I have found several posts on negated expressions, but can't get them to work on a negated HTML tag in the capture group. There is a post with my exact question, but in PHP rather than Notepad++.

I would appreciate any suggestions.

Upvotes: 3

Answers (4)

Jeremy Jones

Reputation: 5631

It seems like this would be a simpler solution:

</?span[^>]*>

Replaced with nothing.

Upvotes: 2

Luis Colorado

Reputation: 12668

In general, you cannot remove complete elements from a XML or HTML document with a regular expression (meaning correctly paired tags) because neither XML nor HTML are regular languages (they are context free). If you try, you can get to this scenario:

<div something="bla bla">
   <someothertag> bla bla </someothertag>
   <div something="foo bar">  <!-- this tag will give you problems -->
         other text
   </div>  <!-- we have to match up to here? (wrong!) -->
</div>  <!-- or here? (right!) -->

Regular languages are languages that cannot count the number of open braces to be able to get input up to the correct matching closing brace. You have to use a context grammar free parser for that. This is the reason some guys here have recommended you to use a XML parser for the task. XML syntax is designed to parse and validate (well, you don't need to validate to properly select the right part of the document) XML documents, which all share the same basic syntax. It's the recommended option (parse it with a XMLParser and then locate the exact element using a XPath library)

On other side, if you only want to leave your HTML document tag free (to eliminate all tags on it) you can do, as the grammar to define one tag is regular. You can search for this pattern:

<([^>"']|"[^"]*"|'[^']*')>

and substitute it with nothing (beware of escaping the proper characters properly, as I don't know which ones are special for NotePad++)

Edit

As suggested, in case you are completely sure no other tags are included inside the ..., you can use this regexp:

<span[ \t]+([^>"']|"[^"]*"|'[^']*'|\n)*(\bclass="foo")([^>"']|"[^"]*"|'[^']*'|\n)*>([^<]*)<\/span>

and substitute it with

$4

as this demo shows.

If you want to eliminate the class discriminator, just use:

<span\b([^>"']|"[^"]*"|'[^']*'|\n)*>([^<]*)<\/span>

and substitute with

$2

as shown in this demo.

Note 2

The reason of such complexity in the first group of parenthesis si due to the possibility of using < and > inside quoted delimiters in element attributes (some are forbidden by xml syntax and must be escaped with < and >, but not everybody follows this approach.

Note 3

After some testing, and seeing that your code allows other tags (not span tags) between span markers, I have changed my regex to:

<span\b([^>"']|"[^"]*"|'[^']*'|\n)*>(([^<]|<[^\/]|<\/[^s]|<\/s[^p]|<\/sp[^a]|<\/spa[^n]|<\/span[^ \t>])*)<\/span>

to allow anything in between  tags but another span tag. See demo. This time you have to select group 2 also

$2

Upvotes: 0

Steve

Reputation: 101

Adapting Luis Colorado's answer, this worked in my case: (([^<]|<[^\/]|<\/[^s]|<\/s[^p]|<\/sp[^a]|<\/spa[^n]|<\/span[^ \t>])*)<\/span> $1 Thanks

Upvotes: 0

Pedro Lobito

Reputation: 98881

To remove ALL tags use:

FIND WHAT:

<.*?>|</.*?>

REPLACE WITH:

NOTHING

To remove SPECIFIC tags, use: