Reputation: 403
The url to scrape : http://aqicn.org/city/chennai//us-consulate/
The reason to do so was to obtain the "pm2.5aqi", "temperature", "humidity", "pressure" data from the website.
The problem : The data scraped and data viewed from the source of the website is NOT same.
The code I used to scrape and display data :
from bs4 import BeautifulSoup
import urllib2
import urllib
import cookielib
url="http://aqicn.org/city/chennai//us-consulate/"
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler(),
urllib2.HTTPHandler(debuglevel=0),
urllib2.HTTPSHandler(debuglevel=0),
urllib2.HTTPCookieProcessor(cj))
page=opener.open(url)
page_soup=BeautifulSoup(page.read(),'html.parser')
print "curr, max, min pmi2.5 aqi : ",
print page_soup.find('td',id='cur_pm25').string," ",page_soup.find('td',id='max_pm25').string," ",page_soup.find('td',id='min_pm25').string
print "curr, max, min temp : ",
print page_soup.find('td',id='cur_t').span.string," ",page_soup.find('td',id='max_t').span.string," ",page_soup.find('td',id='min_t').span.string
print "curr, max, min pressure : ",
print page_soup.find('td',id='cur_p').string," ",page_soup.find('td',id='max_p').string," ",page_soup.find('td',id='min_p').string
print "curr, max, min humidity : ",
print page_soup.find('td',id='cur_h').string," ",page_soup.find('td',id='max_h').string," ",page_soup.find('td',id='min_h').string
What I was doing : I manually identified the tags from the page source which contain the values and printed the same tag's value from the data scraped.
Surprisingly the data displayed and the data present on the page' source was different.
My scraped data :
curr, max, min pmi2.5 aqi : 143 157 109
curr, max, min temp : 24 30 24
curr, max, min pressure : 1012 1014 1010
curr, max, min humidity : 100 100 62
The data on the website was : (the data can be verified from the link, but the data might become outdated, as it is real time data)
curr, max, min pmi2.5 aqi : 108 166 94
curr, max, min temp : 27 30 24
curr, max, min pressure : 1013 1014 1010
curr, max, min humidity : 83 100 62
I checked the same tags again in the page source, and identified the same area by making python display the soup using :
print page_soup.prettify()
But the data was NOT same.
How is this possible? Can someone please explain as to why this weird behaviour occurs? And suggest a work-around / solution for this problem?
Upvotes: 2
Views: 414
Reputation: 9048
The real time data is rendered by a script and it replaces the default data which is your scraped data. I don't know why they put default data in because it is misleading and it should always be replaced. Except of course when it isn't and then it would be better to show an error message than the wrong data.
If you want to scrape this look into a web driver like selenium to render the page for you and then run that through beautiful soup.
Upvotes: 1