Noah
Noah

Reputation: 152

Use Beautiful Soup to Extract Multiple Tables And Headers

I have a piece of HTML structured similarly to:

<div id="MainSection" >

<h3>"Title of the table below"<h3/>
<Table>
<tr><tr/>
<Table\>

<h3>"Title of the table below"<h3/>
<Table>
<tr><tr/>
<Table\>

<h3>"Title of the table below"<h3/>
<Table>
<tr><tr/>
<Table\>

ETC...

I can strip the 'TR' elements fairly easily, creating one big table, but I need to find a way to retain the structure of each individual table elements and get the title for each element.

There are an unknown number of lists and there will be one header for each list.

I am fairly new to python and very new to web scraping.

Upvotes: 0

Views: 340

Answers (1)

QHarr
QHarr

Reputation: 84465

Don't know what expected output should be but with above you could gather h3 and table within nodelist and loop testing tag.name and handling accordingly

html = '''
<html>
 <head></head> 
 <body> 
  <div id="MainSection"> 
   <h3>"Title of the table below"</h3> 
   <table> 
    <tbody> 
     <tr><td>table1</td></tr> 
     <tr><td>x</td></tr> 
    </tbody> 
   </table> 
   <h3>"Title of the table below2"</h3> 
   <table> 
    <tbody> 
     <tr><td>table2</td></tr> 
     <tr><td>y</td></tr> 
    </tbody> 
   </table> 
  </div>  
 </body>
</html>'''

soup = bs(html, 'lxml')

for item in soup.select('#MainSection h3, #MainSection table'):
    if item.name == 'h3':
        header = item.text
        print(header)
    else:
        table = pd.read_html(str(item))[0]
        print(table)

Upvotes: 1

Related Questions