Reputation: 2421
I'm trying to strip out particular chunks of HTML documents, particularly Javascript (<script></script>
) and inline CSS (<style></style>
). Currently I'm trying to use re.sub()
but am not having any luck with Multiline. Any tips?
import re
s = '''<html>
<head>
<title>Some Template</title>
<script type="text/javascript" src="{path to Library}/base.js"></script>
<script type="text/javascript" src="something.js"></script>
<script type="text/javascript" src="simple.js"></script>
</head>
<body>
<script type="text/javascript">
// HelloWorld template
document.write(examples.simple.helloWorld());
</script>
</body>
</html>'''
print(re.sub('<script.*script>', '', s, count=0, flags=re.M))
Upvotes: 1
Views: 287
Reputation: 12623
Alternatively, since you are parsing and modifying HTML, I'd suggest to use a HTML parser like BeautifulSoup.
If you simply want to strip/remove all the script
tags within the HTML tree. You can use .decompose()
or .extract()
.
.extract()
will return the tag that was extracted whereas .decompose()
will simply destroy it.
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
for i in soup('script'):
i.decompose()
print(soup)
As discussed in the comments, you can do additional modifications to the HTML tree. You may refer the docs for more info.
Upvotes: 2
Reputation: 174786
You actually need DOTALL modifier not Multiline .
print(re.sub(r'(?s)<script\b.*?</script>', '', s))
This would remove the leading spaces exists before script
tag.
print(re.sub(r'(?s)\s*<script\b.*?</script>', '', s))
Upvotes: 1