kobo
kobo

Reputation: 65

Scraping with Beautiful Soup does not update values properly

I try to web-scrape weather website but the data does not update properly. The code:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'

while True:
    soup = BeautifulSoup(urlopen(url), 'html.parser')
    data = soup.find("div", {"class": "weather__text"})
    print(data.text)

I am looking at 'WIND & WIND GUST' in 'CURRENT CONDITIONS' section. It prints the first values correctly (for example 1.0 / 2.2 mph) but after that the values update very slowly (at times 5+ minutes pass by) even though they change every 10-20-30 seconds in the website.

And when the values update in Python they are still different from the current values in the website.

Upvotes: 0

Views: 88

Answers (2)

Driftr95
Driftr95

Reputation: 4710

You could try this alternate method: since the site actually retrieves the data from another url, you could just directly make the request and scrape the site only every hour or so to update the request url.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
from datetime import datetime, timedelta
#def getReqUrl...

reqUrl = getReqUrl()
prevTime, prevAt = '', datetime.now()
while True:
  ures = json.loads(urlopen(reqUrl).read())
  if 'observations' not in asd:
    reqUrl = getReqUrl()
    ures = json.loads(urlopen(reqUrl).read())

  #to see time since last update
  obvTime = ures['observations'][0]['obsTimeUtc']
  td = (datetime.now() - prevAt).seconds 

  wSpeed = ures['observations'][0]['imperial']['windSpeed']
  wGust = ures['observations'][0]['imperial']['windGust']
  print('',end=f'\r[+{td}s -> {obvTime}]:   {wGust} ° / {wSpeed} °mph')

  if prevTime < obvTime:
    prevTime = obvTime
    prevAt = datetime.now()
    print('')

Even when making the request directly, the "observation time" in the retrieved data jumps around sometimes, which is why I'm only printing on a fresh line when obvTime increases - without that, it looks like this. (If that's preferred you can just print normally without the '',end='\r... format, and the second if block is no longer necessary either).

The first if block is for refreshing the reqUrl (because it expires after a while), which is when I actually scrape the wunderground site, because the url is inside one of their script tags:

def getReqUrl():
  url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'    
  soup = BeautifulSoup(urlopen(url), 'html.parser')
  appText = soup.select_one('#app-root-state').text

  nxtSt = json.loads(appText.replace('&q;','"'))['wu-next-state-key'] 
  return [
      ns for ns in nxtSt.values() 
      if 'observations' in ns['value'] and 
      len(ns['value']['observations']) == 1
  ][0]['url'].replace('&a;','&')

or, since I know how the url starts, more simply like:

def getReqUrl():
  url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'    
  soup = BeautifulSoup(urlopen(url), 'html.parser')
  appText = soup.select_one('#app-root-state').text
  
  rUrl = 'https://api.weather.com/v2/pws/observations/current'
  rUrl = rUrl + appText.split(rUrl)[1].split('&q;')[0]
  return rUrl.replace('&a;','&')

Upvotes: 1

Khaled Koubaa
Khaled Koubaa

Reputation: 547

try:

import requests
from bs4 import BeautifulSoup

url = 'https://www.wunderground.com/dashboard/pws/KORPISTO1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(url, timeout=30, headers=headers)     # print(r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')

#'WIND & WIND GUST' in 'CURRENT CONDITIONS' section
wind_gust = [float(i.text) for i in soup.select_one('.weather__header:-soup-contains("WIND & GUST")').find_next('div', class_='weather__text').select('span.wu-value-to')]

print(wind_gust)
[1.8, 2.2]

wind = wind_gust[0]
gust = wind_gust[1]

print(wind)
1.8

print(gust)
2.2

Upvotes: 0

Related Questions