MITHU
MITHU

Reputation: 164

Unable to keep parsing names from next pages using requests

I've created a script to parse different names from a table located in a webpage. The script can scrape the names from the landing page. What I can't do is scrape the names from next pages as well.

To produce the results manually in that site, all it is required to do is press the Start Search button with changing nothing.

I've tried so far:

import requests
from bs4 import BeautifulSoup

link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'

payload = {
    'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
    payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
        item_name = item.select("td")[1].text
        print(item_name)

How can I keep parsing names from next pages using requests?

Upvotes: 1

Views: 123

Answers (1)

Yevhen Bondar
Yevhen Bondar

Reputation: 4707

This site uses a different url to handle the first page and all other pages.

I used the google chrome console to get the data format to send the request to the pagination pages.

  1. Enter webpage https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD and press the Start Search
  2. Open google chrome console.
  3. Click on pagination.
  4. Inspect pagination request. Google chrome console
  5. Grab all POST parameters from console and find them in HTML of previous page.

Here is the final python code


import requests
from bs4 import BeautifulSoup

link = 'https://hsapps.azdhs.gov/ls/sod/SearchProv.aspx?type=DD'

payload = {
    'ctl00$ContentPlaceHolder1$btnSubmit1': 'Start Search'
}

PAGE_COUNT = 86

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload['__VIEWSTATE'] = soup.select_one('#__VIEWSTATE')['value']
    payload['__EVENTVALIDATION'] = soup.select_one('#__EVENTVALIDATION')['value']
    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
        item_name = item.select("td")[1].text
        print(item_name)

    link = "https://hsapps.azdhs.gov/ls/sod/Provider.aspx?type=DD"

    for page in range(2, PAGE_COUNT + 1):
        payload = {
            '__VIEWSTATE': soup.select_one('#__VIEWSTATE')['value'],
            '__EVENTVALIDATION': soup.select_one('#__EVENTVALIDATION')['value'],
            '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$ddPage',
            '__VIEWSTATEENCRYPTED': '',
            '__EVENTARGUMENT': '',
            '__LASTFOCUS': '',
            '__VIEWSTATEGENERATOR': soup.select_one('#__VIEWSTATEGENERATOR')['value'],
            'ctl00$ContentPlaceHolder1$HiddenField1':
                soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField1')['value'],
            'ctl00$ContentPlaceHolder1$HiddenField2':
                soup.select_one('#ctl00_ContentPlaceHolder1_HiddenField2')['value'],
            'ctl00$ContentPlaceHolder1$ddPage': str(page),
        }
        r = s.post(link, data=payload)
        soup = BeautifulSoup(r.text, "lxml")
        for item in soup.select("#ctl00_ContentPlaceHolder1_DgFacils tr:has(> td)"):
            item_name = item.select("td")[1].text
            print(item_name)

Upvotes: 3

Related Questions