Reputation: 10011
I'm trying to iteratively crawler the tables from each page on this website. With the code below I'm able to extract one page only:
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894'
website_url = requests.get(url).text
#soup = BeautifulSoup(website_url, 'lxml')
soup = BeautifulSoup(website_url, 'html.parser')
table = soup.find('table', {'class': 'gridview'})
#https://stackoverflow.com/questions/51090632/python-excel-export
df = pd.read_html(str(table))[0]
print(df.head(5))
Output:
序号 ... 竣工备案日期
0 1 ... 2020-01-23
1 2 ... 2020-01-23
2 3 ... 2020-01-23
3 4 ... 2020-01-23
4 5 ... 2020-01-23
[5 rows x 9 columns]
Any ideas how could I get each page content by click next page button on the web? Thank you.
Full code:
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
dfs = pd.DataFrame()
for page in range(1, 790):
data = {
'filter_LIKE_GCMC': '',
'filter_LIKE_JSDWMC': '',
'filter_LIKE_SGDWMC': '',
'filter_LIKE_BABH': '',
'currentPage': page,
'pageSize': '15',
'OrderByField': '',
'OrderByDesc': ''
}
website_url = requests.post('http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894', data = data).text
soup = BeautifulSoup(website_url, 'html.parser')
table = soup.find('table', {'class': 'gridview'})
#https://stackoverflow.com/questions/51090632/python-excel-export
df = pd.read_html(str(table))[0]
df.columns = df.iloc[0]
df = df.iloc[1:]
print(df)
dfs = pd.concat([df, dfs], sort = False)
print(dfs.columns)
dfs.to_excel('./test.xlsx', index = False)
Output:
Traceback (most recent call last):
File "<ipython-input-48-9f217cba563e>", line 28, in <module>
df = pd.read_html(str(table))[0]
File "/Users/x/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
displayed_only=displayed_only)
File "/Users/x/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse
raise_with_traceback(retained)
File "/Users/x/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 404, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No tables found
Upvotes: 0
Views: 1019
Reputation: 12255
You can use data
below to get page by number and other parameters, using POST
. Change currentPage
number to go to the next or previous page. Using td.Normal
css selector you can get total results.
data = {
'filter_LIKE_GCMC': '',
'filter_LIKE_JSDWMC': '',
'filter_LIKE_SGDWMC': '',
'filter_LIKE_BABH': '',
'currentPage': '1',
'pageSize': '15',
'OrderByField': '',
'OrderByDesc': ''
}
website_url = requests.post('http://bjjs.zjw.beijing.gov.cn/eportal/ui?pageId=308894', data=data).text
Upvotes: 1