Marius Johan
Marius Johan

Reputation: 392

How to remove html comments using Beautiful Soup

I'm cleaning text from a crawled website, but I don't want any html comments in my data, so do I have to parse it out myself or is there an existing function to do so?

I've tried doing this:

from bs4 import BeautifulSoup as S
soup = S("<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>")
soup.comment # == None
soup.style   # == <style>html{color: #0000ff}</style>

Upvotes: 1

Views: 713

Answers (2)

the_train
the_train

Reputation: 71

Use regex.

import re
html = "<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>"
html = re.sub('<!--[\s\S]*-->', '', html).strip()
print(html)

Result:

<h1>Hejsa</h1> <style>html{color: #0000ff}</style>

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195418

To search form HTML comments, you can use bs4.Comment type:

from bs4 import BeautifulSoup, Comment

html_doc = '''
    <!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# print comment:
comment = soup.find(text=lambda t: isinstance(t, Comment))
print( comment )

Prints:

t

To extract it:

comment = soup.find(text=lambda t: isinstance(t, Comment))

# extract comment:
comment.extract()
print(soup.prettify())

Prints:

<h1>
 Hejsa
</h1>
<style>
 html{color: #0000ff}
</style>

Upvotes: 1

Related Questions