Get table inside an html comment with python

Question

I am trying to parse a webpage that has a table inside a comment. I can't seem to figure out how to get the columns and data of the table out of the comment. Here's part of the html source:


    
       
       blah, blah    
       
            * indicates something important

I am using PyQuery but am open to other solutions. So far I get a PyQuery document from the html as follows:

from pyquery import PyQuery as pq
import requests

doc = pq(requests.get(url).content)
table = doc('#all_info')

That gets me the PyQuery object with the text I showed above. I also found etree which I can use to isolate the comment text, but then I lose the ability to isolate html markup in the text. Here's that code:

from lxml import etree
tree = etree.fromstring(str(table))
comments = tree.xpath('//comment()')
for c in comments:
    print c

As a note, there's only one comment in each comment list.

Does anyone have other ideas on a better way to approach this? One thought I have is to just remove the comment markup and treat everything in the comment as just valid html. But I couldn't figure out how to do that and keep my ability to use PyQuery to find objects. I am open to using Soup or others.

DYZ · Accepted Answer

If there is indeed only one comment per document, simply remove it before passing the string to BeautifulSoup or whatever you use for parsing:

doc = doc.replace("","")

Get table inside an html comment with python

Answers (1)

Related Questions