Scraped data and real data is not same (python)

Question

The url to scrape : http://aqicn.org/city/chennai//us-consulate/
The reason to do so was to obtain the "pm2.5aqi", "temperature", "humidity", "pressure" data from the website.

The problem : The data scraped and data viewed from the source of the website is NOT same.

The code I used to scrape and display data :

from bs4 import BeautifulSoup
import urllib2
import urllib
import cookielib

url="http://aqicn.org/city/chennai//us-consulate/"
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler(),
        urllib2.HTTPHandler(debuglevel=0),
        urllib2.HTTPSHandler(debuglevel=0),
        urllib2.HTTPCookieProcessor(cj))
page=opener.open(url)
page_soup=BeautifulSoup(page.read(),'html.parser')

print "curr, max, min pmi2.5 aqi : ",
print page_soup.find('td',id='cur_pm25').string,"     ",page_soup.find('td',id='max_pm25').string,"  ",page_soup.find('td',id='min_pm25').string

print "curr, max, min temp : ",
print page_soup.find('td',id='cur_t').span.string,"  ",page_soup.find('td',id='max_t').span.string,"  ",page_soup.find('td',id='min_t').span.string

print "curr, max, min pressure : ",
print page_soup.find('td',id='cur_p').string,"  ",page_soup.find('td',id='max_p').string,"  ",page_soup.find('td',id='min_p').string

print "curr, max, min humidity : ",
print page_soup.find('td',id='cur_h').string,"  ",page_soup.find('td',id='max_h').string,"  ",page_soup.find('td',id='min_h').string

What I was doing : I manually identified the tags from the page source which contain the values and printed the same tag's value from the data scraped.

Surprisingly the data displayed and the data present on the page' source was different.

My scraped data :

curr, max, min pmi2.5 aqi :  143    157    109
curr, max, min temp :  24    30    24
curr, max, min pressure :  1012    1014    1010
curr, max, min humidity :  100    100    62

The data on the website was : (the data can be verified from the link, but the data might become outdated, as it is real time data)

curr, max, min pmi2.5 aqi : 108   166   94
curr, max, min temp : 27   30   24
curr, max, min pressure : 1013   1014   1010
curr, max, min humidity : 83   100   62

I checked the same tags again in the page source, and identified the same area by making python display the soup using :

print page_soup.prettify()

But the data was NOT same.
How is this possible? Can someone please explain as to why this weird behaviour occurs? And suggest a work-around / solution for this problem?

dstudeba · Accepted Answer

The real time data is rendered by a script and it replaces the default data which is your scraped data. I don't know why they put default data in because it is misleading and it should always be replaced. Except of course when it isn't and then it would be better to show an error message than the wrong data.

If you want to scrape this look into a web driver like selenium to render the page for you and then run that through beautiful soup.

Scraped data and real data is not same (python)

Answers (1)

Related Questions