Canapsis
Canapsis

Reputation: 21

How to delete content between text?

I want to delete all <scripts> in all html files in all subfolders. I can't find the correct version of the line

regular expression: <script[\w\W]*?</script>

here's how it looks in the line for my reasons:

find . -type f -name «*.html» -exec sed -i 's/<script[\w\W]*?</script>//g' {} \;

I also tried it on every screening down to: \<script\[\\w\\W\]\*\?\<\/script\>

this doesn't work

There is another option

find -type f -name \*.html | xargs sed -i '/\<script/,/\<\/script\>/c\ '

but it deletes all the contents of the page from the first script to the last. All I need to delete only <script ....</script>

Maybe grep can do it?

Upvotes: 1

Views: 126

Answers (3)

Canapsis
Canapsis

Reputation: 21

I found simple solution:

find . -type f -name "*.html" -exec perl -0 -i -pe 's/<script.*?script>//gs' {} \;

Upvotes: 0

Allan
Allan

Reputation: 12438

Example of file:

$ more input.html 
<!DOCTYPE html>
<html>
  <head>
    <title>Title of the document</title>
  </head>
  <body>
    <p id="example"></p>
    <script>
      document.getElementById("example").innerHTML = "My first JavaScript code";
    </script>
  </body>
</html>

Example of stylesheet:

$ more removescript.xsl 
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml">

    <xsl:output method="html" encoding="utf-8" indent="yes"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" />
        </xsl:copy>
    </xsl:template>

    <xsl:template match="//script" />

</xsl:stylesheet>

Command:

$ xsltproc --html removescript.xsl input.html 
<html>
  <head>
    <title>Title of the document</title>
  </head>
  <body>
    <p id="example"/>

  </body>
</html>

Explanations:

The stylesheet will copy every single node and attribute, when it matches the node <script> </script> it will do nothing (no copy) therefore those nodes will be removed in the result.

Upvotes: 2

kvantour
kvantour

Reputation: 26471

Using regex to parse HTML or XML files is essentially not done (see here and here). Tools such as sed and awk are extremely powerful for handling text files, but when it boils down to parsing complex-structured data — such as XML, HTML, JSON, ... — they are nothing more than a sledgehammer. Yes, you can get the job done, but sometimes at a tremendous cost. For handling such delicate files, you need a bit more finesse by using a more targetted set of tools.

In case of parsing XML or HTML, one can easily use xmlstarlet.

xmlstarlet ed -d '//script'

However, As HTML pages are often not well-formed XML, it might be handy to clean it up a bit using tidy. In the example case above this gives then :

$ tidy -q -numeric -asxhtml --show-warnings no <file.html> \
  | xmlstarlet ed -N "x=http://www.w3.org/1999/xhtml" \
               -d '//script'

where -N gives the XHTML namespace if any, this is recognised by

<html xmlns="http://www.w3.org/1999/xhtml">

In the XHTML output of tidy.

Upvotes: 2

Related Questions