kpaul
kpaul

Reputation: 1

Scraping Weather Data from multiple pages

I am new to python

I want to scrape weather data from the website "http://www.estesparkweather.net/archive_reports.php?date=200901" I have to scrape all the available attributes of weather data for each day from 2009-01-01 to 2018-10-28 I have to represent the scraped data as pandas dataframe object.

Below should be the Dataframe specific details

Expected column names (order dose not matter):

 ['Average temperature (°F)', 'Average humidity (%)',
 'Average dewpoint (°F)', 'Average barometer (in)',
 'Average windspeed (mph)', 'Average gustspeed (mph)',
 'Average direction (°deg)', 'Rainfall for month (in)',
 'Rainfall for year (in)', 'Maximum rain per minute',
 'Maximum temperature (°F)', 'Minimum temperature (°F)',
 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
 'Minimum pressure', 'Maximum windspeed (mph)',
 'Maximum gust speed (mph)', 'Maximum heat index (°F)']

Each record in the dataframe corresponds to weather details of a given day
The index column is date-time format (yyyy-mm-dd)
I need to perform necessary data cleaning and type cast each attributes to relevent data type

After scraping I need to save the dataframe as pickle file by name 'dataframe.pk'

Below is the code I was trying initially just to read the page using Beautifulsoup, But there are multiple pages monthwise , I am not sure how can I loop the urls from January 2009 to October 2018 and get that content into the soup, Can someone help please:

***import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
import numpy as np
url = "http://www.estesparkweather.net/archive_reports.php?date=200901"
page = requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
type(soup)
bs4.BeautifulSoup
# Get the title
title = soup.title
print(title)
# Print out the text
text = soup.get_text()
print(soup.text)

# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])***

Upvotes: 0

Views: 1860

Answers (3)

ankur das
ankur das

Reputation: 1

Below is the one which works for me

import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
Dates_r = pd.date_range(start = '01/01/2009', end = '11/01/2018', freq = 'M')
dates = [str(i)[:4] + str(i)[5:7] for i in Dates_r]
dates[0:5]
df_list = []
index = []
for k in range(len(dates)):
    url = "http://www.estesparkweather.net/archive_reports.php?date="
    url += dates[k]
    page = requests.get(url)
    soup =  BeautifulSoup(page.content,'html.parser')
    table = soup.find_all('table')
    raw_data = [row.text.splitlines() for row in table]
    raw_data = raw_data[:-9]
    for i in range(len(raw_data)):
        raw_data[i] = raw_data[i][2:len(raw_data[i]):3]
    for i in range(len(raw_data)):
        c = ['.'.join(re.findall("\d+",str(raw_data[i][j].split()[:5])))for j in range(len(raw_data[i]))]
        if len(c):
            df_list.append(c)
            index.append(dates[k] + c[0])
        f_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]
        data = [df_list[i][1:] for i in range(len(df_list)) if len(df_list[i][1:]) == 19]
from datetime import datetime
final_index = [datetime.strptime(str(f_index[i]), '%Y%m%d').strftime('%Y-%m-%d') for i in range(len(f_index))]
columns =  ['Average temperature (°F)', 'Average humidity (%)',
 'Average dewpoint (°F)', 'Average barometer (in)',
 'Average windspeed (mph)', 'Average gustspeed (mph)',
 'Average direction (°deg)', 'Rainfall for month (in)',
 'Rainfall for year (in)', 'Maximum rain per minute',
 'Maximum temperature (°F)', 'Minimum temperature (°F)',
 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
 'Minimum pressure', 'Maximum windspeed (mph)',
 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
final_index2 = final_index.copy()
data2 = data.copy()
data2.pop()
data2.pop()
data2.pop()
final_index2.pop()
final_index2.pop()
final_index2.pop()
desired_df = pd.DataFrame(data2, index = final_index2)
desired_df.columns =  ['Average temperature (°F)', 'Average humidity (%)',
 'Average dewpoint (°F)', 'Average barometer (in)',
 'Average windspeed (mph)', 'Average gustspeed (mph)',
 'Average direction (°deg)', 'Rainfall for month (in)',
 'Rainfall for year (in)', 'Maximum rain per minute',
 'Maximum temperature (°F)', 'Minimum temperature (°F)',
 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
 'Minimum pressure', 'Maximum windspeed (mph)',
 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
df = desired_df.apply(pd.to_numeric)
df.index = pd.to_datetime(df.index)
import pickle
with open("dataframe.pk", "wb") as file:
    pickle.dump(df, file)

Upvotes: 0

shabd89
shabd89

Reputation: 1

I just tried writing it from scratch using your initial problem statement and it worked fine for me

range_date = pd.date_range(start = '1/1/2009',end = '11/01/2018',freq = 'M')

dates = [str(i)[:4] + str(i)[5:7] for i in range_date]

lst = []

index = []

for j in tqdm(range(len(dates))):

      url = "http://www.estesparkweather.net/archive_reports.php?date="+ 
      dates[j] 

      page = requests.get(url)
      soup = BeautifulSoup(page.content, 'html.parser')
      table = soup.find_all('table')
    

      data_parse = [row.text.splitlines() for row in table]
      data_parse = data_parse[:-9] 

for k in range(len(data_parse)):
    data_parse[k] = data_parse[k][2:len(data_parse[k]):3]



for l in range(len(data_parse)):
    str_l = [('.'.join(re.findall("\d+",str(data_parse[l][k].split()[:5])))) for k in range(len(parsed_data[l]))]
    lst.append(str_l)
    index.append(dates[j] + str_l[0])

d1_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]
data = [lst[i][1:] for i in range(len(lst)) if len(lst[i][1:]) == 19]

d2_index = [datetime.strptime(str(d1_index[i]), '%Y%m%d').strftime('%Y-%m-%d') for i in range(len(d1_index))]

desired_df = pd.DataFrame(data, index = d2_index)

This should be your desired dataframe and you can do required operation further on this

** you will need to import required modules ** this extracted data from 2009-0-01 to 2018-10-31 .You might need to drop last 3 records to get till 2018-10-28

Upvotes: 0

balderman
balderman

Reputation: 23825

For reading the information in the time range 2009-01-01 to 2018-10-28 you will have to understand the URL pattern

http://www.estesparkweather.net/archive_reports.php?date=YYYYMM

Example:

http://www.estesparkweather.net/archive_reports.php?date=201008

So you need to create a nested loop that reads that data for each year/month combination.

Something like:

URL_TEMPLATE = 'http://www.estesparkweather.net/archive_reports.php?date={}{}'
for year in range(2009,2018):
  for month in range(1,12):
     url = URL_TEMPLATE.format(year,month) 
     # TODO implement the actual scraping of a single page
     # Note that you will need to pad single digit month with zeros

Upvotes: 0

Related Questions