Reputation: 65550
How do I extract all HTML-style comments from a document, using Python?
I've tried using a regex:
text = 'hello, world <!-- comment -->'
re.match('<!--(.*?)-->', text)
But it produces nothing. I don't understand this since the same regex works fine on the same string at https://regex101.com/
UPDATE: My document is actually an XML file, and I'm parsing the document with pyquery (based on lxml), but I don't think lxml can extract comments that aren't inside a node. This is what the document looks like:
<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
<intervention_browse>
<!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm -->
<mesh_term>Freund's Adjuvant</mesh_term>
<mesh_term>Keyhole-limpet hemocyanin</mesh_term>
</intervention_browse>
<!-- Results have not yet been posted for this study -->
</clinical_study>
UPDATE 2: Thanks for suggesting the other answer, but I'm already parsing the document extensively with lxml and don't want to rewrite everything with BeautifulSoup. Have updated title accordingly.
Upvotes: 1
Views: 2896
Reputation: 1451
XPath works just fine here: tree.xpath('//comment()')
. For example removing all scripts, styles, and comments from DOM you could do:
tree = lxml.html.fromstring(html)
for el in tree.xpath('//script | //style | //comment()'):
el.getparent.remove(el)
No BeautifulSoup.
Upvotes: 1
Reputation: 53623
This seems to print the comment for me:
from lxml import etree
txt = """<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
<intervention_browse>
<!-- CAUTION: The following MeSH terms are assigned with an imperfect algorithm -->
<mesh_term>Freund's Adjuvant</mesh_term>
<mesh_term>Keyhole-limpet hemocyanin</mesh_term>
</intervention_browse>
<!-- Results have not yet been posted for this study -->
</clinical_study>"""
root = etree.XML(txt)
print root[0][0]
To get the last comment:
comments = [itm for itm in root if itm.tag is etree.Comment]:
if comments:
print comments[-1]
Upvotes: 2
Reputation: 445
You have to use the re.findall() method to extract all substring that match a certain pattern.
re.match() will only check whether the pattern fits at the beginning of the string, while re.search() will only get you the first match within the string. For your purpose, re.findall() is definitely the right method and should be preferred.
Upvotes: 1
Reputation: 1000
Change match
to search
an then:
text = 'hello, world <!-- comment -->'
comment = re.search('<!--(.*?)-->', text)
comment.group(1)
Output:
' comment '
Upvotes: 1
Reputation: 183
You could use Beautiful Soup's to extract the comment in a for loop like this
from bs4 import BeautifulSoup, Comment
text = 'hello, world <!-- comment -->'
soup = BeautifulSoup(text, 'lxml')
for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
print(x)
Upvotes: 0