David Metcalfe
David Metcalfe

Reputation: 2421

Remove multiline HTML in Python

I'm trying to strip out particular chunks of HTML documents, particularly Javascript (<script></script>) and inline CSS (<style></style>). Currently I'm trying to use re.sub() but am not having any luck with Multiline. Any tips?

import re

s = '''<html>
<head>
  <title>Some Template</title>
  <script type="text/javascript" src="{path to Library}/base.js"></script>
  <script type="text/javascript" src="something.js"></script>
  <script type="text/javascript" src="simple.js"></script>
</head>
<body>
  <script type="text/javascript">
    // HelloWorld template
    document.write(examples.simple.helloWorld());
  </script>
</body>
</html>'''

print(re.sub('<script.*script>', '', s, count=0, flags=re.M))

Upvotes: 1

Views: 287

Answers (2)

JRodDynamite
JRodDynamite

Reputation: 12623

Alternatively, since you are parsing and modifying HTML, I'd suggest to use a HTML parser like BeautifulSoup.

If you simply want to strip/remove all the script tags within the HTML tree. You can use .decompose() or .extract().

.extract() will return the tag that was extracted whereas .decompose() will simply destroy it.

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, "html.parser")
for i in soup('script'):
    i.decompose()

print(soup)

As discussed in the comments, you can do additional modifications to the HTML tree. You may refer the docs for more info.

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174786

You actually need DOTALL modifier not Multiline .

print(re.sub(r'(?s)<script\b.*?</script>', '', s))

This would remove the leading spaces exists before script tag.

print(re.sub(r'(?s)\s*<script\b.*?</script>', '', s))

Upvotes: 1

Related Questions