Reputation: 4181
I want to remove all the javascript code from an HTML document, and leave the actual text. Is there any regex or python script to do this? Thanks.
Upvotes: 4
Views: 3159
Reputation: 3005
You can write a regex looking for '<script'
and 'script>'
and very well do it.
Edit: As @cHao points out - Regex's are bad for parsing HTML.
Regex might still be useful, at places where you have full control over HTML.
Upvotes: 1
Reputation: 2116
You can use this jQuery code to remove:
$(javascript).html('')
and Firebug to inject your jQuery code into the webpage:
>>> var x = window.open("");
Window opened
>>> x
Window about:blank
>>> x.document
Document about:blank
>>> x.document.write("$(javascript).html('')");
Alert popped up
Upvotes: 0
Reputation: 129011
Using BeautifulSoup:
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
with open("with-scripts.html", "r") as f:
soup = BeautifulSoup(f.read())
for script in soup("script"):
script.extract()
with open("without-scripts.html", "w") as f:
f.write(soup.prettify())
Upvotes: 5