How to remove html comments using Beautiful Soup

Question

I'm cleaning text from a crawled website, but I don't want any html comments in my data, so do I have to parse it out myself or is there an existing function to do so?

I've tried doing this:

from bs4 import BeautifulSoup as S
soup = S(" Hejsa
 ")
soup.comment # == None
soup.style   # ==

Andrej Kesely · Accepted Answer

To search form HTML comments, you can use bs4.Comment type:

from bs4 import BeautifulSoup, Comment

html_doc = '''
     Hejsa
 
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# print comment:
comment = soup.find(text=lambda t: isinstance(t, Comment))
print( comment )

Prints:

To extract it:

comment = soup.find(text=lambda t: isinstance(t, Comment))

# extract comment:
comment.extract()
print(soup.prettify())

Prints:


 Hejsa

How to remove html comments using Beautiful Soup

Answers (2)

Related Questions