Reputation: 435
I have an html with a lots of table to traverse to like below:
<html>
.. omitted parts since I am interested on the HTML table..
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="labeltitle">
<tbody>
<tr>
<td class="labeltitle">
<font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
</td>
<td class="labelplain"> </td>
</tr>
</tbody>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
... omitted just to show the td that I am interested to scrape ...
<td class="labelplain"> Senator(s)</td>
<td class="labelplain">
<table>
<tbody>
<tr>
<td class="labelplain">VILLAR JR., MANUEL B.<br></td>
</tr>
</tbody>
</table>
</td>
...
<table>
<table>
... More tables like the table above (the one with VILLAR Jr.)
</table>
<table>
<tbody>
<tr>
<td class="labeltitle">
<table>
<tbody>
<tr>
<td class="labeltitle"> <font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
<td class="labelplain"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
... more tables
</html>
The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:
table = dom.xpath("//table[8]/tbody/tr/td")
print (table)
but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).
I also tried the following:
rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')
I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:
<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor
Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%">
</td>
</tr>
</table></td>
</tr>
I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.
My expected output for this is a json file (special characters are escaped) like:
{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}
*where flooract is the html code of the tables with special characters escaped. Sample snippet:
<table>\n<tbody>\n<tr>\n<td class=\"labelplain\"> Status Date<\/td><td class=\"labelplain\"> 10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\"> Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments
Link to sample file here: https://issuances-library.senate.gov.ph/54629.html I have attached an image of the site:
Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file:
Upvotes: 0
Views: 197