Reputation: 2975
I am looking to parse html and get the list of table ids using python.
I have a HTML document in the below format with multiple tables:
the page I am trying to scrape and get table ids - https://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html
<html>
<div class="table-container">
<div class="table-contents disable-scroll">
<table id="w345aab9c13c11b5"> # this is table id for below table name
<thead>
<tr>
<th class="table-header" colspan="100%">
<div class="title">Domains and IP addresses to add to your allow list</div> # I need to look for this table name and get the table id associated with it
</th>
</tr>
</thead>
<tbody>
...
</tbody>
</table>
</div>
</div>
<div class="table-container">
<div class="table-contents disable-scroll">
<table id="w345aab9c13c13b2">
<thead>
<tr>
<th class="table-header" colspan="100%">
<div class="title">Domains and IP Addresses to Add to Your Allow List for PCoIP</div>
</th>
</tr>
<tr>
...
</tr>
</thead>
<tbody>
...
</tbody>
</table>
</div>
</div>
...
</html>
I need to check for matching value in div
tag and get the table id associated with it
I am new to python, any suggestions on how to approach this or a solution for this would really help.
Upvotes: 0
Views: 1367
Reputation:
You can use BeautifulSoup to get the IDs:
import requests
from bs4 import BeautifulSoup
url = 'http://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
for t in soup.select('table[id]'):
if 'Domains and IP Addresses to Add to Your Allow List' in t.getText():
print(t.attrs['id'])
I trust you can figure out how to incorporate this into your code.
Upvotes: 1