BeautifulSoup: Get table that doesnt appear within the html?

Question

I would like to obtain a table that appears on a URL: https://www.coronavirus.vic.gov.au/exposure-sites

When right-clicking and inspecting the element, it is evident there is a table element with a class that can be referenced. However, when requested this does not appear.

Reproducible example:

import pandas as pd
import requests
from bs4 import BeautifulSoup

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}



link = 'https://www.coronavirus.vic.gov.au/exposure-sites'

r = requests.get(link, headers=header)
soup = BeautifulSoup(r.text, "html5lib")
htmltable = soup.find('table', { 'class' : "rpl-row rpl-search-results-layout__main rpl-row--gutter" })
# Error Appears here because the above doesn't exist even though it should?
print(htmltable) 


def tableDataText(table):    
    """Parses a html segment started with tag  followed 
    by multiple  (table rows) and inner  (table data) tags. 
    It returns a list of rows with inner columns. 
    Accepts only one  (table header/data) in the first row.
    """
    def rowgetDataText(tr, coltag='td'): # td (data) or th (header)       
        return [td.get_text(strip=True) for td in tr.find_all(coltag)]  
    rows = []
    trs = table.find_all('tr')
    headerow = rowgetDataText(trs[0], 'th')
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append(rowgetDataText(tr, 'td') ) # data row       
    return rows


list_table = tableDataText(htmltable)
df = pd.DataFrame(list_table[1:], columns=list_table[0])
df


The end state should be a table which is a collection of the 18 pages of tables on the webpage.

Bhavya Parikh · Accepted Answer

You can make call to this URL to get data as json format which returns list of dictionary data and loop over it data can be extraced from using key associated to it

import requests
from bs4 import BeautifulSoup
res=requests.get(" https://www.coronavirus.vic.gov.au/sdp-ckan?resource_id=afb52611-6061-4a2b-9110-74c920bede77&limit=10000")
data=res.json()


main_data=data['result']['records']
for i in range(len(main_data)):
    print(main_data[i]['Suburb'])
    print(main_data[i]['Site_title'])

Output:

Newport
TyrePlus Newport
Newport
TyrePlus Newport
Newport
...

How to find URL go to Chrome Developer mode and Network tab refresh your site and find data from image (lef hand side) and from preview you will get to kown about URL

Image:

For Dataframe:

import pandas as pd
df=pd.DataFrame(main_data)

BeautifulSoup: Get table that doesnt appear within the html?

Answers (1)

Related Questions