Reputation: 13

How to scrape a table using BeautifulSoup as I couldn't select the table id attribute?

I am pretty new about HTML and web scraping. I've been trying to scrape the table elements from the following link:

https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Hong-Kong-and-Mainland-Market-Highlights?sc_lang=en#select3=0&select2=2&select1=28

What I want to do is to extract elements such as "Total turnover", "Total Market capitalization" etc. As I inspect, all these elements lie in <div class="table-container fixed-freeze-tb-parent" id="Tbl__0">.

What puzzled me was when I created the BeautifulSoup object and retrieved the text file using

turn180329 = requests.get('https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Hong-Kong-and-Mainland-Market-Highlights?sc_lang=en#select3=0&select2=2&select1=28')
turnsoup = bs4.BeautifulSoup(turn180329.text,'lxml')`

`file180329 = open('180329.txt','wb')
for char in turn180329.iter_content(1000000):
    file180329.write(char)
file180329.close()

I could select div[class="table-container fixed-freeze-tb-parent"] to return the div element but returned nothing when I select the id="tbl__0" using

turn_table = turnsoup.find_all('#tbl__0.table-container fixed-freeze-tb-parent')

to extract any desired table elements.

A million appreciation and thanks to anyone who could help me!!!

Upvotes: 1

Answers (3)

user4066647

Reputation:

Like @Juan Javier Santos Ochoa says, the browser actually sends another URL, for which the server responds with JSON data. Here's the code part to complement his answer.

The date part (TDD=29, TMM=3, TYYYY=2018) in this URL can be modified to get results of a different day:

url = 'https://www.hkex.com.hk/eng/csm/ws/Highlightsearch.asmx/GetData?LangCode=en&TDD=29&TMM=3&TYYYY=2018'

Thanks to @Keyur Potdar for pointing out that headers need not be sent.

Here's the line that sends the request and fetches the JSON:

r = requests.get(url)
d = r.json()

And, here's the result:

# Turnover (Mil. shares) - Main Board, GEM
>>> print(d['data'][9]['td'][1])
['232,780', '1,769']

Edit:

You can use an online service like JSON Formatter to understand the structure of a JSON object.

Upvotes: 1

Ajax1234

Reputation: 71471

It appears that the site is dynamic i.e a front-end script updates the DOM with values from the backend when opened by the browser. To scrape from a dynamic site, you will need to use a browser manipulation tool such as selenium:

from selenium import webdriver
d = webdriver.Chrome()
import re
from bs4 import BeautifulSoup as soup
d.get('https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Hong-Kong-and-Mainland-Market-Highlights?sc_lang=en#select3=0&select2=2&select1=28')
new_data = [[b.text for b in i.find_all('td')] for i in soup(d.page_source, 'lxml').find_all('tr')][2:-2]

Output:

[[u'No. of listed companies', u'1,827', u'352', u'1,406', u'51', u'2,096', u'49'], [u'No. of listed H shares', u'230', u'24', u'n.a.', u'n.a.', u'n.a.', u'n.a.'], [u'No. of listed red-chips stocks', u'158', u'6', u'n.a.', u'n.a.', u'n.a.', u'n.a.'], [u'Total no. of listed securities', u'13,527', u'353', u'n.a.', u'n.a.', u'n.a.', u'n.a.'], [u'Total market capitalisation(Bil. dollars)', u'HKD 34,139', u'HKD 264', u'RMB 32,376', u'RMB 91', u'RMB 23,008', u'RMB 74'], [u'Total negotiable capitalisation (Bil. dollars)', u'n.a.', u'n.a.', u'RMB 27,500', u'RMB 91', u'RMB 16,580', u'RMB 73'], [u'Average P/E ratio (Times)', u'12.42', u'42.23', u'17.72', u'21.42', u'32.89', u'11.06'], [u'Total turnover (Mil. shares)', u'232,780', u'1,769', u'17,027', u'22', u'19,244', u'13'], [u'Total turnover (Mil. dollars)', u'HKD 137,287', u'HKD 821', u'RMB 210,972', u'RMB 158', u'RMB 268,838', u'RMB 72'], [u'Total market turnover(Mil. dollars)', u'HKD 138,108', u'RMB 213,189', u'RMB 268,910']]

Upvotes: 0

jjsantoso

Reputation: 1758

That is because the data in the table is not in the html source when you do a request. You can use developers tools of your browser and inspect the requests that the website does. In this case, I can detect that the website does a request to get the data to the url: https://www.hkex.com.hk/eng/csm/ws/Highlightsearch.asmx/GetData?LangCode=en&TDD=29&TMM=3&TYYYY=2018&_=1522759817885

This returns the data of the table in a json format.

Upvotes: 1

How to scrape a table using BeautifulSoup as I couldn&#39;t select the table id attribute?

Answers (3)

Related Questions

How to scrape a table using BeautifulSoup as I couldn't select the table id attribute?