Reputation: 842
I would like to obtain a table that appears on a URL: https://www.coronavirus.vic.gov.au/exposure-sites
When right-clicking and inspecting the element, it is evident there is a table element with a class that can be referenced. However, when requested this does not appear.
Reproducible example:
import pandas as pd
import requests
from bs4 import BeautifulSoup
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
link = 'https://www.coronavirus.vic.gov.au/exposure-sites'
r = requests.get(link, headers=header)
soup = BeautifulSoup(r.text, "html5lib")
htmltable = soup.find('table', { 'class' : "rpl-row rpl-search-results-layout__main rpl-row--gutter" })
# Error Appears here because the above doesn't exist even though it should?
print(htmltable)
def tableDataText(table):
"""Parses a html segment started with tag <table> followed
by multiple <tr> (table rows) and inner <td> (table data) tags.
It returns a list of rows with inner columns.
Accepts only one <th> (table header/data) in the first row.
"""
def rowgetDataText(tr, coltag='td'): # td (data) or th (header)
return [td.get_text(strip=True) for td in tr.find_all(coltag)]
rows = []
trs = table.find_all('tr')
headerow = rowgetDataText(trs[0], 'th')
if headerow: # if there is a header row include first
rows.append(headerow)
trs = trs[1:]
for tr in trs: # for every table row
rows.append(rowgetDataText(tr, 'td') ) # data row
return rows
list_table = tableDataText(htmltable)
df = pd.DataFrame(list_table[1:], columns=list_table[0])
df
The end state should be a table which is a collection of the 18 pages of tables on the webpage.
Upvotes: 0
Views: 48
Reputation: 3400
You can make call to this URL to get data as json format which returns list of dictionary data and loop over it data can be extraced from using key associated to it
import requests
from bs4 import BeautifulSoup
res=requests.get(" https://www.coronavirus.vic.gov.au/sdp-ckan?resource_id=afb52611-6061-4a2b-9110-74c920bede77&limit=10000")
data=res.json()
main_data=data['result']['records']
for i in range(len(main_data)):
print(main_data[i]['Suburb'])
print(main_data[i]['Site_title'])
Output:
Newport
TyrePlus Newport
Newport
TyrePlus Newport
Newport
...
How to find URL go to Chrome Developer mode and Network tab refresh your site and find data from image (lef hand side) and from preview you will get to kown about URL
For Dataframe:
import pandas as pd
df=pd.DataFrame(main_data)
Upvotes: 1