Remove multiline HTML in Python

Question

I'm trying to strip out particular chunks of HTML documents, particularly Javascript () and inline CSS (). Currently I'm trying to use re.sub() but am not having any luck with Multiline. Any tips?

import re

s = '''

  Some Template
  
  
  


  

'''

print(re.sub('', '', s, count=0, flags=re.M))

JRodDynamite · Accepted Answer

Alternatively, since you are parsing and modifying HTML, I'd suggest to use a HTML parser like BeautifulSoup.

If you simply want to strip/remove all the script tags within the HTML tree. You can use .decompose() or .extract().

.extract() will return the tag that was extracted whereas .decompose() will simply destroy it.

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, "html.parser")
for i in soup('script'):
    i.decompose()

print(soup)

As discussed in the comments, you can do additional modifications to the HTML tree. You may refer the docs for more info.

Remove multiline HTML in Python

Answers (2)

Related Questions