Reputation: 152
I have a piece of HTML structured similarly to:
<div id="MainSection" >
<h3>"Title of the table below"<h3/>
<Table>
<tr><tr/>
<Table\>
<h3>"Title of the table below"<h3/>
<Table>
<tr><tr/>
<Table\>
<h3>"Title of the table below"<h3/>
<Table>
<tr><tr/>
<Table\>
ETC...
I can strip the 'TR' elements fairly easily, creating one big table, but I need to find a way to retain the structure of each individual table elements and get the title for each element.
There are an unknown number of lists and there will be one header for each list.
I am fairly new to python and very new to web scraping.
Upvotes: 0
Views: 340
Reputation: 84465
Don't know what expected output should be but with above you could gather h3 and table within nodelist and loop testing tag.name and handling accordingly
html = '''
<html>
<head></head>
<body>
<div id="MainSection">
<h3>"Title of the table below"</h3>
<table>
<tbody>
<tr><td>table1</td></tr>
<tr><td>x</td></tr>
</tbody>
</table>
<h3>"Title of the table below2"</h3>
<table>
<tbody>
<tr><td>table2</td></tr>
<tr><td>y</td></tr>
</tbody>
</table>
</div>
</body>
</html>'''
soup = bs(html, 'lxml')
for item in soup.select('#MainSection h3, #MainSection table'):
if item.name == 'h3':
header = item.text
print(header)
else:
table = pd.read_html(str(item))[0]
print(table)
Upvotes: 1