Reputation: 1
I am new to python
I want to scrape weather data from the website "http://www.estesparkweather.net/archive_reports.php?date=200901" I have to scrape all the available attributes of weather data for each day from 2009-01-01 to 2018-10-28 I have to represent the scraped data as pandas dataframe object.
Below should be the Dataframe specific details
Expected column names (order dose not matter):
['Average temperature (°F)', 'Average humidity (%)',
'Average dewpoint (°F)', 'Average barometer (in)',
'Average windspeed (mph)', 'Average gustspeed (mph)',
'Average direction (°deg)', 'Rainfall for month (in)',
'Rainfall for year (in)', 'Maximum rain per minute',
'Maximum temperature (°F)', 'Minimum temperature (°F)',
'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
'Minimum pressure', 'Maximum windspeed (mph)',
'Maximum gust speed (mph)', 'Maximum heat index (°F)']
Each record in the dataframe corresponds to weather details of a given day
The index column is date-time format (yyyy-mm-dd)
I need to perform necessary data cleaning and type cast each attributes to relevent data type
After scraping I need to save the dataframe as pickle file by name 'dataframe.pk'
Below is the code I was trying initially just to read the page using Beautifulsoup, But there are multiple pages monthwise , I am not sure how can I loop the urls from January 2009 to October 2018 and get that content into the soup, Can someone help please:
***import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
import numpy as np
url = "http://www.estesparkweather.net/archive_reports.php?date=200901"
page = requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
type(soup)
bs4.BeautifulSoup
# Get the title
title = soup.title
print(title)
# Print out the text
text = soup.get_text()
print(soup.text)
# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])***
Upvotes: 0
Views: 1860
Reputation: 1
Below is the one which works for me
import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
Dates_r = pd.date_range(start = '01/01/2009', end = '11/01/2018', freq = 'M')
dates = [str(i)[:4] + str(i)[5:7] for i in Dates_r]
dates[0:5]
df_list = []
index = []
for k in range(len(dates)):
url = "http://www.estesparkweather.net/archive_reports.php?date="
url += dates[k]
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')
table = soup.find_all('table')
raw_data = [row.text.splitlines() for row in table]
raw_data = raw_data[:-9]
for i in range(len(raw_data)):
raw_data[i] = raw_data[i][2:len(raw_data[i]):3]
for i in range(len(raw_data)):
c = ['.'.join(re.findall("\d+",str(raw_data[i][j].split()[:5])))for j in range(len(raw_data[i]))]
if len(c):
df_list.append(c)
index.append(dates[k] + c[0])
f_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]
data = [df_list[i][1:] for i in range(len(df_list)) if len(df_list[i][1:]) == 19]
from datetime import datetime
final_index = [datetime.strptime(str(f_index[i]), '%Y%m%d').strftime('%Y-%m-%d') for i in range(len(f_index))]
columns = ['Average temperature (°F)', 'Average humidity (%)',
'Average dewpoint (°F)', 'Average barometer (in)',
'Average windspeed (mph)', 'Average gustspeed (mph)',
'Average direction (°deg)', 'Rainfall for month (in)',
'Rainfall for year (in)', 'Maximum rain per minute',
'Maximum temperature (°F)', 'Minimum temperature (°F)',
'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
'Minimum pressure', 'Maximum windspeed (mph)',
'Maximum gust speed (mph)', 'Maximum heat index (°F)']
final_index2 = final_index.copy()
data2 = data.copy()
data2.pop()
data2.pop()
data2.pop()
final_index2.pop()
final_index2.pop()
final_index2.pop()
desired_df = pd.DataFrame(data2, index = final_index2)
desired_df.columns = ['Average temperature (°F)', 'Average humidity (%)',
'Average dewpoint (°F)', 'Average barometer (in)',
'Average windspeed (mph)', 'Average gustspeed (mph)',
'Average direction (°deg)', 'Rainfall for month (in)',
'Rainfall for year (in)', 'Maximum rain per minute',
'Maximum temperature (°F)', 'Minimum temperature (°F)',
'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
'Minimum pressure', 'Maximum windspeed (mph)',
'Maximum gust speed (mph)', 'Maximum heat index (°F)']
df = desired_df.apply(pd.to_numeric)
df.index = pd.to_datetime(df.index)
import pickle
with open("dataframe.pk", "wb") as file:
pickle.dump(df, file)
Upvotes: 0
Reputation: 1
I just tried writing it from scratch using your initial problem statement and it worked fine for me
range_date = pd.date_range(start = '1/1/2009',end = '11/01/2018',freq = 'M')
dates = [str(i)[:4] + str(i)[5:7] for i in range_date]
lst = []
index = []
for j in tqdm(range(len(dates))):
url = "http://www.estesparkweather.net/archive_reports.php?date="+
dates[j]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find_all('table')
data_parse = [row.text.splitlines() for row in table]
data_parse = data_parse[:-9]
for k in range(len(data_parse)):
data_parse[k] = data_parse[k][2:len(data_parse[k]):3]
for l in range(len(data_parse)):
str_l = [('.'.join(re.findall("\d+",str(data_parse[l][k].split()[:5])))) for k in range(len(parsed_data[l]))]
lst.append(str_l)
index.append(dates[j] + str_l[0])
d1_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]
data = [lst[i][1:] for i in range(len(lst)) if len(lst[i][1:]) == 19]
d2_index = [datetime.strptime(str(d1_index[i]), '%Y%m%d').strftime('%Y-%m-%d') for i in range(len(d1_index))]
desired_df = pd.DataFrame(data, index = d2_index)
This should be your desired dataframe and you can do required operation further on this
** you will need to import required modules ** this extracted data from 2009-0-01 to 2018-10-31 .You might need to drop last 3 records to get till 2018-10-28
Upvotes: 0
Reputation: 23825
For reading the information in the time range 2009-01-01 to 2018-10-28 you will have to understand the URL pattern
http://www.estesparkweather.net/archive_reports.php?date=YYYYMM
Example:
http://www.estesparkweather.net/archive_reports.php?date=201008
So you need to create a nested loop that reads that data for each year/month combination.
Something like:
URL_TEMPLATE = 'http://www.estesparkweather.net/archive_reports.php?date={}{}'
for year in range(2009,2018):
for month in range(1,12):
url = URL_TEMPLATE.format(year,month)
# TODO implement the actual scraping of a single page
# Note that you will need to pad single digit month with zeros
Upvotes: 0