Reputation: 392
I'm cleaning text from a crawled website, but I don't want any html comments in my data, so do I have to parse it out myself or is there an existing function to do so?
I've tried doing this:
from bs4 import BeautifulSoup as S
soup = S("<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>")
soup.comment # == None
soup.style # == <style>html{color: #0000ff}</style>
Upvotes: 1
Views: 713
Reputation: 71
Use regex.
import re
html = "<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>"
html = re.sub('<!--[\s\S]*-->', '', html).strip()
print(html)
Result:
<h1>Hejsa</h1> <style>html{color: #0000ff}</style>
Upvotes: 1
Reputation: 195418
To search form HTML comments, you can use bs4.Comment
type:
from bs4 import BeautifulSoup, Comment
html_doc = '''
<!-- t --> <h1>Hejsa</h1> <style>html{color: #0000ff}</style>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# print comment:
comment = soup.find(text=lambda t: isinstance(t, Comment))
print( comment )
Prints:
t
To extract it:
comment = soup.find(text=lambda t: isinstance(t, Comment))
# extract comment:
comment.extract()
print(soup.prettify())
Prints:
<h1>
Hejsa
</h1>
<style>
html{color: #0000ff}
</style>
Upvotes: 1