Jon S
Jon S

Reputation: 11

Regexp, Python and doc comments <!-- text -->

I am programing in Python 2.7, I am using beautifulsoup4 to extract information from tags of series of documents. However the document has as well strings as:

<!-- PJG ITAG l=90 g=1 f=4 -->

And I want to get rid of them, however I am not an expert on regexps. Can someone help with this please?

Upvotes: 1

Views: 76

Answers (1)

Thomas Orozco
Thomas Orozco

Reputation: 55233

Start by loading your HTML in BeautifulSoup:

from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(the_html)

Then, remove all the comments:

comments = soup.find_all(text = lambda text:isinstance(text, Comment))
for comment in comments:
    comment.extract()

Upvotes: 3

Related Questions