Serena
Serena

Reputation: 67

BeautifulSoup4 not able to scrape data from this table

Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.

I'm trying to scrape data from this website. Specifically, from this part/table of the page:

enter image description here

末"四"位数 9775,2275,4775,7275 末"五"位数 03881,23881,43881,63881,83881,16913,66913 末"六"位数 313110,563110,813110,063110 末"七"位数 4210962,9210962,9785582 末"八"位数 63262036 末"九"位数 080876872

I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.

Here is my source code:

import bs4 as bs
import urllib.request

def scrapezqh(url):
    source = urllib.request.urlopen(url).read()
    page = bs.BeautifulSoup(source, 'html.parser')
    print(page)

url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))

It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:

<td class="tdcolor">网下有效申购股数(万股)
            </td>
<td class="tdwidth" id="td_wxyxsggs"> 
            </td>
</tr>
<tr id="tr_zqh">
<td class="tdtitle" id="td_zqhrowspan">中签号
            </td>
<td class="tdcolor">中签号公布日期
            </td>
<td class="ltxt" colspan="3"> 2018-02-22 (周四)
            </td>

I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!

Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot) enter image description here

Upvotes: 2

Views: 251

Answers (2)

chidimo
chidimo

Reputation: 2958

From where I look at things your question isn't clear to me. But here's what I did.

I do a lot of webscraping so I just made a package to get me beautiful soup objects of any webpage. Package is here. So my answer depends on that. But you can take a look at the sourcecode and see that there's really nothing esoteric about it. You may drag out the soup-making part and use as you wish.

Here we go.

pip install pywebber --upgrade

from pywebber import PageRipper

page = PageRipper(url='http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1', parser='html5lib')

page_soup = page.soup

tr_zqh_table = page_soup.find('tr', id='tr_zqh')

from here you can do tr_zqh_table.find_all('td')

tr_zqh_table.find_all('td')

Output

[
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>, <td class="tdcolor">中签号公布日期
</td>, <td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
]

Going a bit further

for td in tr_zqh_table.find_all('td'):
    print(td.contents)

Output

['中签号\n                ']
['中签号公布日期\n                ']
['\xa02018-02-22 (周四)\n                ']

Upvotes: 0

Dan-Dev
Dan-Dev

Reputation: 9440

If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.

import requests
import re
import json
from pprint import pprint

s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token  = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)

url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])

for i in range (len(j)):
    print ( j[i]['LOTNUM'])


#pprint(j)

Outputs:

9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872

Upvotes: 2

Related Questions