How to parse HTML and get table ids using Python

Question

I am looking to parse html and get the list of table ids using python.

I have a HTML document in the below format with multiple tables:

the page I am trying to scrape and get table ids - https://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html



  
      # this is table id for below table name
        
          ...
     
           
              
                 Domains and IP addresses to add to your allow list # I need to look for this table name and get the table id associated with it
              
           
        
        
    
  


  
     
          ...
           
          ...
     
        
           
              
                 Domains and IP Addresses to Add to Your Allow List for PCoIP
              
           
           
        
        
    
  

...

I need to check for matching value in div tag and get the table id associated with it

I am new to python, any suggestions on how to approach this or a solution for this would really help.

user5386938 · Accepted Answer

You can use BeautifulSoup to get the IDs:

import requests
from bs4 import BeautifulSoup

url = 'http://docs.aws.amazon.com/workspaces/latest/adminguide/workspaces-port-requirements.html'

resp = requests.get(url)

soup = BeautifulSoup(resp.content, 'html.parser')

for t in soup.select('table[id]'):
    if 'Domains and IP Addresses to Add to Your Allow List' in t.getText():
        print(t.attrs['id'])

I trust you can figure out how to incorporate this into your code.

How to parse HTML and get table ids using Python

Answers (1)

Related Questions