Tendekai Muchenje
Tendekai Muchenje

Reputation: 563

How to iterate through multiple results pages when web scraping with Beautiful Soup

I have a script that i have written where i use Beautiful Soup to scrape a website for search results. I have managed to isolate the data that i want via its class name.

However, the search results are not on a single page. Instead, they are spread over multiple pages so i want get them all. I want to make my script be able to check if there is a next results page and run itself there as well. Since the results vary in number, i do not know how many pages of results exist so i cant predefine a range to iterate over. I have also tried to use an 'if_page_exists' check. However, if i put a page number that is out of the ranges of results, the page always exists, it just doesnt have any resulta but has a page which says there are no results to display.

What i have noticed however is that each page result has a 'Next' link which has id 'NextLink1' and the last page result does not have this. So i think thats were the magic might be. But i dont know how and where to implement that check. I have been getting infinite loops and stuff.

The script below finds the results for search term 'x'. Assistance would be greatly appreciated.

from urllib.request import urlopen
from bs4 import BeautifulSoup

#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
all_letters= ['x']
for letter in all_letters:

    page_number = 1
    url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)
    html = urlopen(url)
    bsObj = BeautifulSoup(html)
    nameList = bsObj.findAll("td", {"class":"party-name"})

    for name in nameList:
        print(name.get_text())

Also, does anyone know a shorter way of instantiating a list of alphanumeric characters thata better than the one i commented out in the script above?

Upvotes: 0

Views: 1833

Answers (1)

SLePort
SLePort

Reputation: 15461

Try this:

from urllib.request import urlopen
from bs4 import BeautifulSoup


#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
all_letters= ['x']
pages = []

def get_url(letter, page_number):
    return "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)

def list_names(soup):
    nameList = soup.findAll("td", {"class":"party-name"})
    for name in nameList:
        print(name.get_text())

def get_soup(letter, page):
    url = get_url(letter, page)
    html = urlopen(url)
    return BeautifulSoup(html)

def main():
    for letter in all_letters:
        bsObj = get_soup(letter, 1)

        sel = bsObj.find('select', {"name": "ctl00$ctl00$InternetApplication_Body$WebApplication_Body$SearchResultPageList1"})    
        for opt in sel.findChildren("option", selected = lambda x: x != "selected"):
            pages.append(opt.string)

        list_names(bsObj)

        for page in pages:
            bsObj = get_soup(letter, page)
            list_names(bsObj)
main()

In the main() function, from first page get_soup(letter, 1) we find and store in a list the select options values that contains all page numbers.

Next, we loop over page numbers to extract data from next pages.

Upvotes: 1

Related Questions