Reputation: 164
I've created a script to parse different names from a table located in a webpage. The script can scrape the names from the landing page. What I can't do is scrape the names from next pages as well.
To produce the results manually in that site, all it is required to do is press the Start Search
button with changing nothing.
I've tried so far:
import requests
from bs4 import BeautifulSoup
link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'
payload = {
'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
r = s.post(link,data=payload)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
item_name = item.select("td")[1].text
print(item_name)
How can I keep parsing names from next pages using requests?
Upvotes: 1
Views: 123
Reputation: 4707
This site uses a different url to handle the first page and all other pages.
I used the google chrome console to get the data format to send the request to the pagination pages.
Start Search
Here is the final python code
import requests
from bs4 import BeautifulSoup
link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'
payload = {
'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
}
PAGE_COUNT = 86
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
r = s.post(link,data=payload)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
item_name = item.select("td")[1].text
print(item_name)
link = "https://hsapps.azdhs.gov/ls/sod/Provider.aspx?type=DD"
for page in range(2, PAGE_COUNT + 1):
payload = {
'__VIEWSTATE': soup.select_one('#__VIEWSTATE')['value'],
'__EVENTVALIDATION': soup.select_one('#__EVENTVALIDATION')['value'],
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$ddPage',
'__VIEWSTATEENCRYPTED': '',
'__EVENTARGUMENT': '',
'__LASTFOCUS': '',
'__VIEWSTATEGENERATOR': soup.select_one('#__VIEWSTATEGENERATOR')['value'],
'ctl00$ContentPlaceHolder1$HiddenField1':
soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField1')['value'],
'ctl00$ContentPlaceHolder1$HiddenField2':
soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField2')['value'],
'ctl00$ContentPlaceHolder1$ddPage': str(page),
}
r = s.post(link, data=payload)
soup = BeautifulSoup(r.text, "lxml")
for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
item_name = item.select("td")[1].text
print(item_name)
Upvotes: 3