schnydszch
schnydszch

Reputation: 435

Python Beautifulsoup traverse a table with particular text content in innerHTML then get contents until before a particular element

I have an html with a lots of table to traverse to like below:

<html>
 .. omitted parts since I am interested on the HTML table..
 <table>
  <tbody>
   <tr>
    <td>
     <table>
      <tbody>
       <tr>
        <td class="labeltitle">
         <tbody>
          <tr>
           <td class="labeltitle">
            <font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
           </td>
           <td class="labelplain">&nbsp;&nbsp;&nbsp;</td>
          </tr>
         </tbody>
        </td>
       </tr>
      </tbody>
     </table>
    </td>
   </tr>
  </tbody>
 </table>
 <table>
  ... omitted just to show the td that I am interested to scrape ...
         <td class="labelplain">&nbsp;Senator(s)</td>
         <td class="labelplain">
          <table>
           <tbody>
            <tr> 
             <td class="labelplain">VILLAR JR., MANUEL B.<br></td>
            </tr>
           </tbody>
          </table>
         </td>
    ... 
 <table>
 <table>
    ... More tables like the table above (the one with VILLAR Jr.)
 </table>
 <table>
  <tbody>
   <tr> 
    <td class="labeltitle">
     <table>
      <tbody>
       <tr> 
        <td class="labeltitle">&nbsp;<font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
        <td class="labelplain">&nbsp;&nbsp;&nbsp;</td>
       </tr>
      </tbody>
     </table>
    </td>
   </tr>
  </tbody>
 </table>   
   
 ... more tables
 
</html>

The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:

    table = dom.xpath("//table[8]/tbody/tr/td")
    print (table)

but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).

I also tried the following:

rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')

I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:

<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor 
                                Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%"> 
                                   </td>
</tr>
</table></td>
</tr>

I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.

My expected output for this is a json file (special characters are escaped) like:

{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}

*where flooract is the html code of the tables with special characters escaped. Sample snippet:

<table>\n<tbody>\n<tr>\n<td class=\"labelplain\">&nbsp;Status Date<\/td><td class=\"labelplain\">&nbsp;10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\">&nbsp;Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments 

Link to sample file here: https://issuances-library.senate.gov.ph/54629.html I have attached an image of the site: First part of the screenshot Second part of the screenshot

Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file: Encircled in red lines are what I wanted to get

Upvotes: 0

Views: 197

Answers (0)

Related Questions