beyond_inifinity
beyond_inifinity

Reputation: 453

scrape and parse table from nested URLs in python

I want to scrape and parse the table from the nested URL link and create a pandas data frame and export it. I figured out how to scrape table from HTML page if the page has one table, but now I need to scrape and parse table from sublink which is on mother link, I think I need to loop through all sublink for parsing its table that I interested. I am wondering is there any efficient way to do this using BeautifulSoup. Can anyone point me out how to make this happen?

my attempt

Here is my current attempt to scrape and parse a single table from HTML page, but I don't know how to scrape and parse the table with its unique table name from nested HTML page and create a pandas dataframe at the end.

def scrape_table(url):
    response = requests.get(url, timeout=10)
    bs= BeautifulSoup(response.content, 'html.parser')
    table = bs.find('table')
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells =[]
        for cell in row.findAll('td'):
            text = cell.text
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)
    x= list_of_rows[1:]
    df = pd.DataFrame(x, index=None)
    df.to_csv("output.csv")

but here is what I want to do:

main_entry_html = "http://www.bom.gov.au/climate/current/statement_archives.shtml"
child_url_1= "http://www.bom.gov.au/climate/current/month/aus/archive/202001.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202002.summary.shtml"
child_url_2 = "http://www.bom.gov.au/climate/current/month/aus/archive/202003.summary.shtml"
...

and so on, I need to access all monthly summary link through 2015-01 to 2020-07, and scrape and parse table with titled of Area-average rainfall, finally create dataframe as my desired output shown.

I think I can use for loop to iterate each child URL link (a.k.a, month summary link) then parse the table that I want by looking at its table name. I am not sure how to achieve this in python? Can anyone point me out how to make this happen? any possible thoughts?

desired output:

here is the desired dataframe that I want to get after scraping and parsing all the tables from all child-URL-link. Here is the example dataframe with dummy values:

desired dataframe

Is there any way I can get my desired dataframe above? How can I scrape and parse the table from nested-url-link? Can anyone give me possible ideas of how to achieve my expected output? Thanks

Upvotes: 2

Views: 565

Answers (1)

Jonas
Jonas

Reputation: 1769

You can use pandas inbuild function pandas.read_html() to directly find the table and convert it into a DataFrame. After this you can save all DataFrames into a list and concate them. Also i would recommend to use string formatting to get all the urls, because you can loop trough all your dates which also gives you the opportunity to save the dates as a column in your DataFrame.

import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

dates = [201901, 201902, 201903, 201904, 201905]

year = ['2019','2020']
dates = []

for year in year:
    for month in range(1,13):
        if year == '2020' and month == 7: break 
        if month <= 9: 
            dates.append(year + '0' + str(month))
        else:
            dates.append(year + str(month))

frames = []

for date in dates:
    r = requests.get(f'http://www.bom.gov.au/climate/current/month/aus/archive/{date}.summary.shtml', timeout = 10)
    soup = BeautifulSoup(r.content, 'html.parser')

    table = soup.find_all('table')   
    table = pd.read_html(str(soup))[2]      
    result = pd.DataFrame(table)

    dates_col = pd.DataFrame({'dates': [str(date)[:4] + '-' + str(date)[4:] for i in range(len(result.index)+1)]})
    result.insert(0, 'date', dates_col)
    result.columns = ['dates', 'region', 'rank', 'average', 'departure from mean', 'comment']

    frames.append(result)
    time.sleep(1)


full_df = pd.concat(frames, ignore_index=True)

Upvotes: 2

Related Questions