Reputation: 13
I am pretty new about HTML and web scraping. I've been trying to scrape the table elements from the following link:
What I want to do is to extract elements such as "Total turnover", "Total Market capitalization" etc. As I inspect, all these elements lie in <div class="table-container fixed-freeze-tb-parent" id="Tbl__0">
.
What puzzled me was when I created the BeautifulSoup object and retrieved the text file using
turn180329 = requests.get('https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Hong-Kong-and-Mainland-Market-Highlights?sc_lang=en#select3=0&select2=2&select1=28')
turnsoup = bs4.BeautifulSoup(turn180329.text,'lxml')`
`file180329 = open('180329.txt','wb')
for char in turn180329.iter_content(1000000):
file180329.write(char)
file180329.close()
I could select div[class="table-container fixed-freeze-tb-parent"]
to return the div
element but returned nothing when I select the id="tbl__0"
using
turn_table = turnsoup.find_all('#tbl__0.table-container fixed-freeze-tb-parent')
to extract any desired table elements.
A million appreciation and thanks to anyone who could help me!!!
Upvotes: 1
Views: 221
Reputation:
Like @Juan Javier Santos Ochoa says, the browser actually sends another URL, for which the server responds with JSON data. Here's the code part to complement his answer.
The date part (TDD=29
, TMM=3
, TYYYY=2018
) in this URL can be modified to get results of a different day:
url = 'https://www.hkex.com.hk/eng/csm/ws/Highlightsearch.asmx/GetData?LangCode=en&TDD=29&TMM=3&TYYYY=2018'
Thanks to @Keyur Potdar for pointing out that headers need not be sent.
Here's the line that sends the request and fetches the JSON:
r = requests.get(url)
d = r.json()
And, here's the result:
# Turnover (Mil. shares) - Main Board, GEM
>>> print(d['data'][9]['td'][1])
['232,780', '1,769']
Edit:
Upvotes: 1
Reputation: 71461
It appears that the site is dynamic i.e a front-end script updates the DOM with values from the backend when opened by the browser. To scrape from a dynamic site, you will need to use a browser manipulation tool such as selenium
:
from selenium import webdriver
d = webdriver.Chrome()
import re
from bs4 import BeautifulSoup as soup
d.get('https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Hong-Kong-and-Mainland-Market-Highlights?sc_lang=en#select3=0&select2=2&select1=28')
new_data = [[b.text for b in i.find_all('td')] for i in soup(d.page_source, 'lxml').find_all('tr')][2:-2]
Output:
[[u'No. of listed companies', u'1,827', u'352', u'1,406', u'51', u'2,096', u'49'], [u'No. of listed H shares', u'230', u'24', u'n.a.', u'n.a.', u'n.a.', u'n.a.'], [u'No. of listed red-chips stocks', u'158', u'6', u'n.a.', u'n.a.', u'n.a.', u'n.a.'], [u'Total no. of listed securities', u'13,527', u'353', u'n.a.', u'n.a.', u'n.a.', u'n.a.'], [u'Total market capitalisation(Bil. dollars)', u'HKD 34,139', u'HKD 264', u'RMB 32,376', u'RMB 91', u'RMB 23,008', u'RMB 74'], [u'Total negotiable capitalisation (Bil. dollars)', u'n.a.', u'n.a.', u'RMB 27,500', u'RMB 91', u'RMB 16,580', u'RMB 73'], [u'Average P/E ratio (Times)', u'12.42', u'42.23', u'17.72', u'21.42', u'32.89', u'11.06'], [u'Total turnover (Mil. shares)', u'232,780', u'1,769', u'17,027', u'22', u'19,244', u'13'], [u'Total turnover (Mil. dollars)', u'HKD 137,287', u'HKD 821', u'RMB 210,972', u'RMB 158', u'RMB 268,838', u'RMB 72'], [u'Total market turnover(Mil. dollars)', u'HKD 138,108', u'RMB 213,189', u'RMB 268,910']]
Upvotes: 0
Reputation: 1758
That is because the data in the table is not in the html source when you do a request. You can use developers tools of your browser and inspect the requests that the website does. In this case, I can detect that the website does a request to get the data to the url: https://www.hkex.com.hk/eng/csm/ws/Highlightsearch.asmx/GetData?LangCode=en&TDD=29&TMM=3&TYYYY=2018&_=1522759817885
This returns the data of the table in a json format.
Upvotes: 1