BeautifulSoup4 not able to scrape data from this table

Question

Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.

I'm trying to scrape data from this website. Specifically, from this part/table of the page:

末"四"位数 9775,2275,4775,7275 末"五"位数 03881,23881,43881,63881,83881,16913,66913 末"六"位数 313110,563110,813110,063110 末"七"位数 4210962,9210962,9785582 末"八"位数 63262036 末"九"位数 080876872

I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.

Here is my source code:

import bs4 as bs
import urllib.request

def scrapezqh(url):
    source = urllib.request.urlopen(url).read()
    page = bs.BeautifulSoup(source, 'html.parser')
    print(page)

url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))

It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:

网下有效申购股数(万股)
            
 
            


中签号
            
中签号公布日期
            
 2018-02-22 (周四)

I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!

Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot)

Dan-Dev · Accepted Answer

If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.

import requests
import re
import json
from pprint import pprint

s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token  = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)

url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])

for i in range (len(j)):
    print ( j[i]['LOTNUM'])


#pprint(j)

Outputs:

9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872

BeautifulSoup4 not able to scrape data from this table

Answers (2)

Related Questions