Unable to keep parsing names from next pages using requests

Question

I've created a script to parse different names from a table located in a webpage. The script can scrape the names from the landing page. What I can't do is scrape the names from next pages as well.

To produce the results manually in that site, all it is required to do is press the Start Search button with changing nothing.

I've tried so far:

import requests
from bs4 import BeautifulSoup

link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'

payload = {
    'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
    payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
        item_name = item.select("td")[1].text
        print(item_name)

How can I keep parsing names from next pages using requests?

Yevhen Bondar · Accepted Answer

This site uses a different url to handle the first page and all other pages.

I used the google chrome console to get the data format to send the request to the pagination pages.

Enter webpage https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD and press the Start Search
Open google chrome console.
Click on pagination.
Inspect pagination request.
Grab all POST parameters from console and find them in HTML of previous page.

Here is the final python code


import requests
from bs4 import BeautifulSoup

link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'

payload = {
    'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
}

PAGE_COUNT = 86

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
    payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
        item_name = item.select("td")[1].text
        print(item_name)

    link = "https://hsapps.azdhs.gov/ls/sod/Provider.aspx?type=DD"

    for page in range(2, PAGE_COUNT + 1):
        payload = {
            '__VIEWSTATE': soup.select_one('#__VIEWSTATE')['value'],
            '__EVENTVALIDATION': soup.select_one('#__EVENTVALIDATION')['value'],
            '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$ddPage',
            '__VIEWSTATEENCRYPTED': '',
            '__EVENTARGUMENT': '',
            '__LASTFOCUS': '',
            '__VIEWSTATEGENERATOR': soup.select_one('#__VIEWSTATEGENERATOR')['value'],
            'ctl00$ContentPlaceHolder1$HiddenField1':
                soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField1')['value'],
            'ctl00$ContentPlaceHolder1$HiddenField2':
                soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField2')['value'],
            'ctl00$ContentPlaceHolder1$ddPage': str(page),
        }
        r = s.post(link, data=payload)
        soup = BeautifulSoup(r.text, "lxml")
        for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
            item_name = item.select("td")[1].text
            print(item_name)

Unable to keep parsing names from next pages using requests

Answers (1)

Related Questions