ah bon
ah bon

Reputation: 10011

Get the previous and next page tables from pagination URL in Python

I'm trying to iteratively crawler the tables from each page on this website. With the code below I'm able to extract one page only:

import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894'
website_url = requests.get(url).text
#soup = BeautifulSoup(website_url, 'lxml')
soup = BeautifulSoup(website_url, 'html.parser')
table = soup.find('table', {'class': 'gridview'})
#https://stackoverflow.com/questions/51090632/python-excel-export
df = pd.read_html(str(table))[0]
print(df.head(5)) 

Output:

   序号  ...      竣工备案日期
0   1  ...  2020-01-23
1   2  ...  2020-01-23
2   3  ...  2020-01-23
3   4  ...  2020-01-23
4   5  ...  2020-01-23

[5 rows x 9 columns]

Any ideas how could I get each page content by click next page button on the web? Thank you.

enter image description here

Full code:

import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

dfs = pd.DataFrame()
for page in range(1, 790):
    data = {
      'filter_LIKE_GCMC': '',
      'filter_LIKE_JSDWMC': '',
      'filter_LIKE_SGDWMC': '',
      'filter_LIKE_BABH': '',
      'currentPage': page,
      'pageSize': '15',
      'OrderByField': '',
      'OrderByDesc': ''
    }

    website_url = requests.post('http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894', data = data).text
    soup = BeautifulSoup(website_url, 'html.parser')
    table = soup.find('table', {'class': 'gridview'})
    #https://stackoverflow.com/questions/51090632/python-excel-export
    df = pd.read_html(str(table))[0]
    df.columns = df.iloc[0]
    df = df.iloc[1:]
    print(df)
    dfs = pd.concat([df, dfs], sort = False)
print(dfs.columns)
dfs.to_excel('./test.xlsx', index = False)

Output:

Traceback (most recent call last):

  File "<ipython-input-48-9f217cba563e>", line 28, in <module>
    df = pd.read_html(str(table))[0]

  File "/Users/x/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
    displayed_only=displayed_only)

  File "/Users/x/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse
    raise_with_traceback(retained)

  File "/Users/x/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 404, in raise_with_traceback
    raise exc.with_traceback(traceback)

ValueError: No tables found

Upvotes: 0

Views: 1019

Answers (1)

Sers
Sers

Reputation: 12255

You can use data below to get page by number and other parameters, using POST. Change currentPage number to go to the next or previous page. Using td.Normal css selector you can get total results.

data = {
  'filter_LIKE_GCMC': '',
  'filter_LIKE_JSDWMC': '',
  'filter_LIKE_SGDWMC': '',
  'filter_LIKE_BABH': '',
  'currentPage': '1',
  'pageSize': '15',
  'OrderByField': '',
  'OrderByDesc': ''
}

website_url = requests.post('http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894', data=data).text

Upvotes: 1

Related Questions