Reputation: 468
I am trying to parse a webpage that has a table inside a comment. I can't seem to figure out how to get the columns and data of the table out of the comment. Here's part of the html source:
<div id="all_info" class="table_wrapper setup_commented commented">
<div class="section_heading">
<span class="section_anchor" id="id_link" data-label="interesting data"/>
<h2>blah, blah</h2>
<div class="section_heading_text">
<ul> <li>* indicates something important</li></ul>
</div>
</div>
<div class="placeholder"/>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_info">
<table class="sortable stats_table" id="info" data-cols-to-freeze=1> <caption>Interesting data Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr class="over_header"> <td> these are discard filler headers</td>
</tr>
<tr> <td> there are multiple entries here for headers </td>
</tr>
</thead>
<tbody>
<tr ><td> Lots of data here in series of columns </td>
</tr>
</tbody>
</table>
</div>
</div>
-->
</div>
I am using PyQuery but am open to other solutions. So far I get a PyQuery document from the html as follows:
from pyquery import PyQuery as pq
import requests
doc = pq(requests.get(url).content)
table = doc('#all_info')
That gets me the PyQuery object with the text I showed above. I also found etree which I can use to isolate the comment text, but then I lose the ability to isolate html markup in the text. Here's that code:
from lxml import etree
tree = etree.fromstring(str(table))
comments = tree.xpath('//comment()')
for c in comments:
print c
As a note, there's only one comment in each comment list.
Does anyone have other ideas on a better way to approach this? One thought I have is to just remove the comment markup and treat everything in the comment as just valid html. But I couldn't figure out how to do that and keep my ability to use PyQuery to find objects. I am open to using Soup or others.
Upvotes: 0
Views: 549
Reputation: 57033
If there is indeed only one comment per document, simply remove it before passing the string to BeautifulSoup
or whatever you use for parsing:
doc = doc.replace("<!--","").replace("-->","")
Upvotes: 1