Richard
Richard

Reputation: 65550

Extract HTML comments in Python, using regex or lxml?

How do I extract all HTML-style comments from a document, using Python?

I've tried using a regex:

text = 'hello, world <!-- comment -->'
re.match('<!--(.*?)-->', text)

But it produces nothing. I don't understand this since the same regex works fine on the same string at https://regex101.com/

UPDATE: My document is actually an XML file, and I'm parsing the document with pyquery (based on lxml), but I don't think lxml can extract comments that aren't inside a node. This is what the document looks like:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Freund's Adjuvant</mesh_term>
    <mesh_term>Keyhole-limpet hemocyanin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been posted for this study                                -->
</clinical_study>

UPDATE 2: Thanks for suggesting the other answer, but I'm already parsing the document extensively with lxml and don't want to rewrite everything with BeautifulSoup. Have updated title accordingly.

Upvotes: 1

Views: 2896

Answers (5)

Pero
Pero

Reputation: 1451

XPath works just fine here: tree.xpath('//comment()'). For example removing all scripts, styles, and comments from DOM you could do:

tree = lxml.html.fromstring(html)
for el in tree.xpath('//script | //style | //comment()'):
    el.getparent.remove(el)

No BeautifulSoup.

Upvotes: 1

David Zemens
David Zemens

Reputation: 53623

This seems to print the comment for me:

from lxml import etree
txt = """<?xml version="1.0" encoding="UTF-8"?>
<clinical_study rank="220398">
  <intervention_browse>
    <!-- CAUTION:  The following MeSH terms are assigned with an imperfect algorithm  -->
    <mesh_term>Freund's Adjuvant</mesh_term>
    <mesh_term>Keyhole-limpet hemocyanin</mesh_term>
  </intervention_browse>
  <!-- Results have not yet been posted for this study                                -->
</clinical_study>"""
root = etree.XML(txt)
print root[0][0]

enter image description here

To get the last comment:

comments = [itm for itm in root if itm.tag is etree.Comment]:
if comments:
    print comments[-1]

Upvotes: 2

Rafael Albert
Rafael Albert

Reputation: 445

You have to use the re.findall() method to extract all substring that match a certain pattern.

re.match() will only check whether the pattern fits at the beginning of the string, while re.search() will only get you the first match within the string. For your purpose, re.findall() is definitely the right method and should be preferred.

Upvotes: 1

pawelty
pawelty

Reputation: 1000

Change match to search an then:

text = 'hello, world <!-- comment -->'
comment = re.search('<!--(.*?)-->', text)
comment.group(1)

Output:

' comment '

Upvotes: 1

Andrew Feather
Andrew Feather

Reputation: 183

You could use Beautiful Soup's to extract the comment in a for loop like this

from bs4 import BeautifulSoup, Comment

text = 'hello, world <!-- comment -->'

soup = BeautifulSoup(text, 'lxml')

for x in soup.findAll(text=lambda text:isinstance(text, Comment)):
    print(x)

Upvotes: 0

Related Questions