TheBoss
TheBoss

Reputation: 769

Regular Expression for Extracting Script Tags

I am trying to write a regular expression in C# to remove all script tags and anything contained within them.

So far I have come up with the following: \<([^:]*?:)?script\>[^(\</<([^:]*?:)?script\>)]*?\</script\>, however this does not work.

I'll break it up and explain my thinking in each section:

\<([^:]*?:)?script\>

Here I am trying to state that it should get any script element, even if it is prefixed with a namespace, say, <a:script></a:script>. I have also added this to the closing tag.

[^(\</<([^:]*?:)?script\>)]*?

Here I am trying to state that it should allow anything to be contained within the tags except for </a:script>, </script>, etc.

\</script\>

Here I am stating that it should have a closing tag.

Can anyone spot where I am going wrong?

Upvotes: 8

Views: 12194

Answers (2)

Robert Koritnik
Robert Koritnik

Reputation: 105029

This regular expression does the trick just fine:

\<(?:[^:]+:)?script\>.*?\<\/(?:[^:]+:)?script\>

But don't do it please

You will run into a problem by this simple HTML:

<script>
var s = "<script></script>";
</script>

How are you going to solve this problem? It is smarter to use the HTML Agility Pack for such things.

Upvotes: 22

Related Questions