Reputation: 769
I am trying to write a regular expression in C# to remove all script tags and anything contained within them.
So far I have come up with the following: \<([^:]*?:)?script\>[^(\</<([^:]*?:)?script\>)]*?\</script\>
, however this does not work.
I'll break it up and explain my thinking in each section:
\<([^:]*?:)?script\>
Here I am trying to state that it should get any script element, even if it is prefixed with a namespace, say, <a:script></a:script>.
I have also added this to the closing tag.
[^(\</<([^:]*?:)?script\>)]*?
Here I am trying to state that it should allow anything to be contained within the tags except for </a:script>
, </script>
, etc.
\</script\>
Here I am stating that it should have a closing tag.
Can anyone spot where I am going wrong?
Upvotes: 8
Views: 12194
Reputation: 54734
You can't parse HTML with regular expressions.
Use the HTML Agility Pack instead.
Upvotes: 16
Reputation: 105029
This regular expression does the trick just fine:
\<(?:[^:]+:)?script\>.*?\<\/(?:[^:]+:)?script\>
You will run into a problem by this simple HTML:
<script>
var s = "<script></script>";
</script>
How are you going to solve this problem? It is smarter to use the HTML Agility Pack for such things.
Upvotes: 22