Reputation: 453
I want to scrape and parse the table from the nested URL link and create a pandas
data frame and export it. I figured out how to scrape table from HTML page if the page has one table, but now I need to scrape and parse table from sublink which is on mother link, I think I need to loop through all sublink for parsing its table that I interested. I am wondering is there any efficient way to do this using BeautifulSoup
. Can anyone point me out how to make this happen?
my attempt
Here is my current attempt to scrape and parse a single table from HTML page, but I don't know how to scrape and parse the table with its unique table name from nested HTML page and create a pandas dataframe at the end.
def scrape_table(url):
response = requests.get(url, timeout=10)
bs= BeautifulSoup(response.content, 'html.parser')
table = bs.find('table')
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells =[]
for cell in row.findAll('td'):
text = cell.text
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
x= list_of_rows[1:]
df = pd.DataFrame(x, index=None)
df.to_csv("output.csv")
but here is what I want to do:
main_entry_html = "http://www.bom.gov.au/climate/current/statement_archives.shtml"
child_url_1= "http://www.bom.gov.au/climate/current/month/aus/archive/202001.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202002.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202003.summary.shtml"
...
and so on, I need to access all monthly summary link through 2015-01 to 2020-07
, and scrape and parse table with titled of Area-average rainfall
, finally create dataframe as my desired output shown.
I think I can use for loop to iterate each child URL link (a.k.a, month summary link) then parse the table that I want by looking at its table name. I am not sure how to achieve this in python? Can anyone point me out how to make this happen? any possible thoughts?
desired output:
here is the desired dataframe that I want to get after scraping and parsing all the tables from all child-URL-link. Here is the example dataframe with dummy values:
Is there any way I can get my desired dataframe above? How can I scrape and parse the table from nested-url-link? Can anyone give me possible ideas of how to achieve my expected output? Thanks
Upvotes: 2
Views: 565
Reputation: 1769
You can use pandas inbuild function pandas.read_html()
to directly find the table and convert it into a DataFrame
. After this you can save all DataFrames into a list and concate
them. Also i would recommend to use string formatting
to get all the urls, because you can loop trough all your dates which also gives you the opportunity to save the dates as a column in your DataFrame.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
dates = [201901, 201902, 201903, 201904, 201905]
year = ['2019','2020']
dates = []
for year in year:
for month in range(1,13):
if year == '2020' and month == 7: break
if month <= 9:
dates.append(year + '0' + str(month))
else:
dates.append(year + str(month))
frames = []
for date in dates:
r = requests.get(f'http://www.bom.gov.au/climate/current/month/aus/archive/{date}.summary.shtml', timeout = 10)
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all('table')
table = pd.read_html(str(soup))[2]
result = pd.DataFrame(table)
dates_col = pd.DataFrame({'dates': [str(date)[:4] + '-' + str(date)[4:] for i in range(len(result.index)+1)]})
result.insert(0, 'date', dates_col)
result.columns = ['dates', 'region', 'rank', 'average', 'departure from mean', 'comment']
frames.append(result)
time.sleep(1)
full_df = pd.concat(frames, ignore_index=True)
Upvotes: 2