mellifluous
mellifluous

Reputation: 2975

How to parse HTML and get table ids using Python

I am looking to parse html and get the list of table ids using python.

I have a HTML document in the below format with multiple tables:

the page I am trying to scrape and get table ids - https://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html

<html>
<div class="table-container">
  <div class="table-contents disable-scroll">
     <table id="w345aab9c13c11b5"> # this is table id for below table name
        <thead>
           <tr>
              <th class="table-header" colspan="100%">
                 <div class="title">Domains and IP addresses to add to your allow list</div> # I need to look for this table name and get the table id associated with it
              </th>
           </tr>
        </thead>
        <tbody>
          ...
     </tbody>
    </table>
  </div>
</div>
<div class="table-container">
  <div class="table-contents disable-scroll">
     <table id="w345aab9c13c13b2">
        <thead>
           <tr>
              <th class="table-header" colspan="100%">
                 <div class="title">Domains and IP Addresses to Add to Your Allow List for PCoIP</div>
              </th>
           </tr>
           <tr>
          ...
           </tr>
        </thead>
        <tbody>
          ...
     </tbody>
    </table>
  </div>
</div>
...
</html>

I need to check for matching value in div tag and get the table id associated with it

I am new to python, any suggestions on how to approach this or a solution for this would really help.

Upvotes: 0

Views: 1367

Answers (1)

user5386938
user5386938

Reputation:

You can use BeautifulSoup to get the IDs:

import requests
from bs4 import BeautifulSoup

url = 'http://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html'

resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')

for t in soup.select('table[id]'):
    if 'Domains and IP Addresses to Add to Your Allow List' in t.getText():
        print(t.attrs['id'])

I trust you can figure out how to incorporate this into your code.

Upvotes: 1

Related Questions