Reputation: 21
I want to delete all <scripts>
in all html files in all subfolders.
I can't find the correct version of the line
regular expression: <script[\w\W]*?</script>
here's how it looks in the line for my reasons:
find . -type f -name «*.html» -exec sed -i 's/<script[\w\W]*?</script>//g' {} \;
I also tried it on every screening down to:
\<script\[\\w\\W\]\*\?\<\/script\>
this doesn't work
There is another option
find -type f -name \*.html | xargs sed -i '/\<script/,/\<\/script\>/c\ '
but it deletes all the contents of the page from the first script to the last.
All I need to delete only <script ....</script>
Maybe grep can do it?
Upvotes: 1
Views: 126
Reputation: 21
I found simple solution:
find . -type f -name "*.html" -exec perl -0 -i -pe 's/<script.*?script>//gs' {} \;
Upvotes: 0
Reputation: 12438
Example of file:
$ more input.html
<!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>
<body>
<p id="example"></p>
<script>
document.getElementById("example").innerHTML = "My first JavaScript code";
</script>
</body>
</html>
Example of stylesheet:
$ more removescript.xsl
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xsl:output method="html" encoding="utf-8" indent="yes"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()" />
</xsl:copy>
</xsl:template>
<xsl:template match="//script" />
</xsl:stylesheet>
Command:
$ xsltproc --html removescript.xsl input.html
<html>
<head>
<title>Title of the document</title>
</head>
<body>
<p id="example"/>
</body>
</html>
Explanations:
The stylesheet will copy every single node and attribute, when it matches the node <script> </script>
it will do nothing (no copy) therefore those nodes will be removed in the result.
Upvotes: 2
Reputation: 26471
Using regex to parse HTML or XML files is essentially not done (see here and here). Tools such as sed
and awk
are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.
In case of parsing XML or HTML, one can easily use xmlstarlet
.
xmlstarlet ed -d '//script'
However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy
. In the example case above this gives then :
$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
| xmlstarlet ed -N "x=http://www.w3.org/1999/xhtml" \
-d '//script'
where -N
gives the XHTML namespace if any, this is recognised by
<html xmlns="http://www.w3.org/1999/xhtml">
In the XHTML output of tidy
.
Upvotes: 2