Reputation: 65
I try to web-scrape weather website but the data does not update properly. The code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
while True:
soup = BeautifulSoup(urlopen(url), 'html.parser')
data = soup.find("div", {"class": "weather__text"})
print(data.text)
I am looking at 'WIND & WIND GUST' in 'CURRENT CONDITIONS' section. It prints the first values correctly (for example 1.0 / 2.2 mph) but after that the values update very slowly (at times 5+ minutes pass by) even though they change every 10-20-30 seconds in the website.
And when the values update in Python they are still different from the current values in the website.
Upvotes: 0
Views: 88
Reputation: 4710
You could try this alternate method: since the site actually retrieves the data from another url, you could just directly make the request and scrape the site only every hour or so to update the request url.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from datetime import datetime, timedelta
#def getReqUrl...
reqUrl = getReqUrl()
prevTime, prevAt = '', datetime.now()
while True:
ures = json.loads(urlopen(reqUrl).read())
if 'observations' not in asd:
reqUrl = getReqUrl()
ures = json.loads(urlopen(reqUrl).read())
#to see time since last update
obvTime = ures['observations'][0]['obsTimeUtc']
td = (datetime.now() - prevAt).seconds
wSpeed = ures['observations'][0]['imperial']['windSpeed']
wGust = ures['observations'][0]['imperial']['windGust']
print('',end=f'\r[+{td}s -> {obvTime}]: {wGust} ° / {wSpeed} °mph')
if prevTime < obvTime:
prevTime = obvTime
prevAt = datetime.now()
print('')
Even when making the request directly, the "observation time" in the retrieved data jumps around sometimes, which is why I'm only printing on a fresh line when obvTime
increases - without that, it looks like this. (If that's preferred you can just print normally without the '',end='\r...
format, and the second if
block is no longer necessary either).
The first if
block is for refreshing the reqUrl
(because it expires after a while), which is when I actually scrape the wunderground site, because the url is inside one of their script
tags:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
nxtSt = json.loads(appText.replace('&q;','"'))['wu-next-state-key']
return [
ns for ns in nxtSt.values()
if 'observations' in ns['value'] and
len(ns['value']['observations']) == 1
][0]['url'].replace('&a;','&')
or, since I know how the url starts, more simply like:
def getReqUrl():
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
soup = BeautifulSoup(urlopen(url), 'html.parser')
appText = soup.select_one('#app-root-state').text
rUrl = 'https://api.weather.com/v2/pws/observations/current'
rUrl = rUrl + appText.split(rUrl)[1].split('&q;')[0]
return rUrl.replace('&a;','&')
Upvotes: 1
Reputation: 547
try:
import requests
from bs4 import BeautifulSoup
url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers) # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
#'WIND & WIND GUST' in 'CURRENT CONDITIONS' section
wind_gust = [float(i.text) for i in soup.select_one('.weather__header:-soup-contains("WIND & GUST")').find_next('div', class_='weather__text').select('span.wu-value-to')]
print(wind_gust)
[1.8, 2.2]
wind = wind_gust[0]
gust = wind_gust[1]
print(wind)
1.8
print(gust)
2.2
Upvotes: 0