Reputation: 61
I am trying to scrape some weather data from wunderground.com using beautifulsoup 4. I was able to find a tutorial on how to do this, however it is showing how to do it using an HTML source code. Wunderground.com used to be in HTML when the tutorial was made, however it is now in js.
I was able to obtain the code and manipulate it to my specific data retrieval needs, but I am stuck on how to get it pulling javascript instead of HTML. Can anyone help with this?
The code is below and I sourced it from kiengiv from SAS Business Analytics on youtube.
from bs4 import BeautifulSoup
import urllib3, csv, os, datetime, urllib3.request, re, sys
for vYear in range(2016, 2019):
for vMonth in range(1, 13):
for vDay in range(1, 32):
# go to the next month, if it is a leap year and greater than the 29th or if it is not a leap year
# and greater than the 28th
if vYear % 4 == 0:
if vMonth == 2 and vDay > 29:
break
else:
if vMonth == 2 and vDay > 28:
break
# go to the next month, if it is april, june, september or november and greater than the 30th
if vMonth in [4, 6, 9, 11] and vDay > 30:
break
# defining the date string to export and go to the next day using the url
theDate = str(vYear) + "/" + str(vMonth) + "/" + str(vDay)
# the new url created after each day
theurl = "https://www.wunderground.com/history/daily/us/ma/cambridge/KBOS/" + theDate + "date.html"
# extract the source data for analysis
http = urllib3.PoolManager()
thepage = http.request('GET', theurl)
soup = BeautifulSoup(thepage, "html.parser")
MaxWindSpeed = Visibility = SeaLevelPressure = Precipitation = High_Temp = Low_Temp = Day_Average_Temp = "N/A"
for temp in soup.find_all('tr'):
if temp.text.strip().replace('\n', '')[:6] == 'Actual' or temp.text.strip().replace('\n', '')[-6:] == "Record":
pass
elif temp.text.replace('\n', '')[-7:] == "RiseSet":
break
elif temp.find_all('td')[0].text == "Day Average Temp":
if temp.find_all('td')[1].text.strip() == "-":
Mean = "N/A"
else:
Mean = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
elif temp.find_all('td')[0].text == "High Temp":
if temp.find_all('td')[1].text.strip() == "-":
Max = "N/A"
else:
Max = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
elif temp.find_all('td')[0].text == "Low Temp":
if temp.find_all('td')[1].text.strip() == "-":
Min = "N/A"
else:
Min = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
elif temp.find_all('td')[0].text == "Growing Degree Days":
if temp.find_all('td')[1].text.strip() == "-":
GrowingDegreeDays = "N/A"
else:
GrowingDegreeDays = temp.find_all('td')[1].text
elif temp.find_all('td')[0].text == "Heating Degree Days":
if temp.find_all('td')[1].text.strip() == "-":
HeatingDegreeDays = "N/A"
else:
HeatingDegreeDays = temp.find_all('td')[1].text
elif temp.find_all('td')[0].text == "Dew Point":
if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
DewPoint = "N/A"
else:
DewPoint = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
elif temp.find_all('td')[0].text == "Precipitation" and temp.find_all('td')[1].text.strip() != "":
if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
Precipitation = "N/A"
else:
Precipitation = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
elif temp.find_all('td')[0].text == "Sea Level Pressure" and temp.find_all('td')[1].text.strip() != "":
if temp.find_all('td')[1].text.strip() == "-":
SeaLevelPressure = "N/A"
else:
SeaLevelPressure = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
elif temp.find_all('td')[0].text == "Max Wind Speed":
if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
MaxWindSpeed = "N/A"
else:
MaxWindSpeed = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
elif temp.find_all('td')[0].text == "Visibility":
if temp.find_all('td')[1].text.strip() == "-":
Visibility = "N/A"
else:
Visibility = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
break
# combining the values to be written to the CSV file
CombinedString = theDate + "," + Mean + "," + Max + "," + Min + "," + HeatingDegreeDays + "," + DewPoint + "," + "," + Precipitation + "," + SeaLevelPressure + "," + MaxWindSpeed + "," + Visibility + "," + Events + "\n"
file.write(bytes(CombinedString, encoding="ascii", errors='ignore'))
# printing to help with any debugging and tracking progress
print(CombinedString)
file.close()
Upvotes: 0
Views: 918
Reputation: 6109
Unless you're using selenium, the data cannot be scrapped with beautifulsoup. Instead I found several Json which contain the data you need (not sure about this, I don't know which data you want)
You can find all json in the developer console (f12)
I particulary found this one (hilighted on the picture) : https://api.weather.com/v1/geocode/42.36416626/-71.00499725/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=20160810&endDate=20160810&units=e
You can iterate on it by changing the startDate and endDate. You can also change the geolocalisation after "geocode"
To fetch the Json, you can use urllib3 and the library json.
import urllib3
import json
http = urllib3.PoolManager()
r = http.request(
'GET',
url,
headers = {
'Accept': 'application/json'
})
json.loads(r.data.decode('utf-8'))
Upvotes: 1