Paige Blackstone
Paige Blackstone

Reputation: 61

using beautiful soup 4 to scrape weather data (site is coded in javascript)

I am trying to scrape some weather data from wunderground.com using beautifulsoup 4. I was able to find a tutorial on how to do this, however it is showing how to do it using an HTML source code. Wunderground.com used to be in HTML when the tutorial was made, however it is now in js.

I was able to obtain the code and manipulate it to my specific data retrieval needs, but I am stuck on how to get it pulling javascript instead of HTML. Can anyone help with this?

The code is below and I sourced it from kiengiv from SAS Business Analytics on youtube.

from bs4 import BeautifulSoup
import urllib3, csv, os, datetime, urllib3.request, re, sys

for vYear in range(2016, 2019):
  for vMonth in range(1, 13):
    for vDay in range(1, 32):
        # go to the next month, if it is a leap year and greater than the 29th or if it is not a leap year
        # and greater than the 28th
        if vYear % 4 == 0:
            if vMonth == 2 and vDay > 29:
                break
        else:
            if vMonth == 2 and vDay > 28:
                break
        # go to the next month, if it is april, june, september or november and greater than the 30th
        if vMonth in [4, 6, 9, 11] and vDay > 30:
            break

        # defining the date string to export and go to the next day using the url
        theDate = str(vYear) + "/" + str(vMonth) + "/" + str(vDay)

        # the new url created after each day
        theurl = "https://www.wunderground.com/history/daily/us/ma/cambridge/KBOS/" + theDate + "date.html"
        # extract the source data for analysis
        http = urllib3.PoolManager()
        thepage = http.request('GET', theurl)
        soup = BeautifulSoup(thepage, "html.parser")
        MaxWindSpeed = Visibility = SeaLevelPressure = Precipitation = High_Temp = Low_Temp = Day_Average_Temp = "N/A"
        for temp in soup.find_all('tr'):
            if temp.text.strip().replace('\n', '')[:6] == 'Actual' or temp.text.strip().replace('\n', '')[-6:] == "Record":
                pass
            elif temp.text.replace('\n', '')[-7:] == "RiseSet":
                break
            elif temp.find_all('td')[0].text == "Day Average Temp":
                if temp.find_all('td')[1].text.strip() == "-":
                    Mean = "N/A"
                else:
                    Mean = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "High Temp":
                if temp.find_all('td')[1].text.strip() == "-":
                    Max = "N/A"
                else:
                    Max = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Low Temp":
                if temp.find_all('td')[1].text.strip() == "-":
                    Min = "N/A"
                else:
                    Min = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Growing Degree Days":
                if temp.find_all('td')[1].text.strip() == "-":
                    GrowingDegreeDays = "N/A"
                else:
                    GrowingDegreeDays = temp.find_all('td')[1].text
            elif temp.find_all('td')[0].text == "Heating Degree Days":
                if temp.find_all('td')[1].text.strip() == "-":
                    HeatingDegreeDays = "N/A"
                else:
                    HeatingDegreeDays = temp.find_all('td')[1].text
            elif temp.find_all('td')[0].text == "Dew Point":
                if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
                    DewPoint = "N/A"
                else:
                    DewPoint = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Precipitation" and temp.find_all('td')[1].text.strip() != "":
                if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
                    Precipitation = "N/A"
                else:
                    Precipitation = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Sea Level Pressure" and temp.find_all('td')[1].text.strip() != "":
                if temp.find_all('td')[1].text.strip() == "-":
                    SeaLevelPressure = "N/A"
                else:
                    SeaLevelPressure = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Max Wind Speed":
                if temp.find_all('td')[1].text.strip() == "-" or temp.find_all('td')[1].text.strip() == "":
                    MaxWindSpeed = "N/A"
                else:
                    MaxWindSpeed = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
            elif temp.find_all('td')[0].text == "Visibility":
                if temp.find_all('td')[1].text.strip() == "-":
                    Visibility = "N/A"
                else:
                    Visibility = temp.find_all('td')[1].find(attrs={"<td _ngcontent-c7" : "</td>"}).text
                    break

        # combining the values to be written to the CSV file
        CombinedString = theDate + "," + Mean + "," + Max + "," + Min + "," + HeatingDegreeDays + "," + DewPoint + "," + "," + Precipitation + "," + SeaLevelPressure + "," + MaxWindSpeed + "," + Visibility + "," + Events + "\n"
        file.write(bytes(CombinedString, encoding="ascii", errors='ignore'))

        # printing to help with any debugging and tracking progress
        print(CombinedString)

file.close()

Upvotes: 0

Views: 918

Answers (1)

BlueSheepToken
BlueSheepToken

Reputation: 6109

Unless you're using selenium, the data cannot be scrapped with beautifulsoup. Instead I found several Json which contain the data you need (not sure about this, I don't know which data you want)

You can find all json in the developer console (f12)

enter image description here

I particulary found this one (hilighted on the picture) : https://api.weather.com/v1/geocode/42.36416626/-71.00499725/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=20160810&endDate=20160810&units=e

You can iterate on it by changing the startDate and endDate. You can also change the geolocalisation after "geocode"

To fetch the Json, you can use urllib3 and the library json.

import urllib3
import json

http = urllib3.PoolManager()
r = http.request(
    'GET',
    url,
    headers = {
        'Accept': 'application/json'
    })
json.loads(r.data.decode('utf-8'))

Upvotes: 1

Related Questions