Reputation: 71
I am trying to scrape historical weather data from this website: http://www.hko.gov.hk/cis/dailyExtract_uc.htm?y=2016&m=1
After some reading on the AJAX call, I found the proper way to request data is through the following code:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
headers = {
'Accept': 'text/plain, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.hko.gov.hk',
'Referer': 'http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2016&m=3',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as s:
#request April 2015 weather data
r = s.get(r"http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_201504.xml",verify = False, headers = headers)
soup = BeautifulSoup(r.content,'lxml')
data = json.loads(soup.get_text())['stn']['data'][0]['dayData'][:-2]
df = pd.DataFrame(data)
I noticed the data I retrieved does not contain the 3 columns on the right hand side, what did I miss in the get request?
Upvotes: 1
Views: 126
Reputation: 28620
fix the request Url. Change:
http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_201504.xml
to
http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml
then you can grab the 4th element (or some other specific month) in the list data['stn']['data']
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
headers = {
'Accept': 'text/plain, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.hko.gov.hk',
'Referer': 'http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2016&m=3',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as s:
#request April 2015 weather data
data = s.get(r"http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml",verify = False, headers = headers).json()
df = pd.DataFrame(data['stn']['data'][3]['dayData'])
Upvotes: 0
Reputation: 84465
Seems if you request entire year then extract month it is there
import requests
import json
with requests.Session() as s:
r = s.get(r"http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml",headers = {'User-Agent': 'Mozilla/5.0'}).json()
print(r['stn']['data'][3]['dayData'][0])
Upvotes: 2
Reputation: 71
Sorry guys I have solved the issue and this is a stupid question.... Turns out the older data has a different source than the recent ones and I got confused on the format.
Upvotes: 0