user5563910
user5563910

Reputation:

Parsing multiple urls with Python and BeautifulSoup

I started learning Python today and so it is not a surprise that I am struggling with some basics. I am trying to parse data from a school website for a project and I managed to parse the first page. However, there are multiple pages (results are paginated).

I have an idea about how to go about it, ie, run through the urls in a loop since I know the url format but I have no idea how to proceed. I figured it would be better to somehow search for the "next" button and run the function if it is there, if not, then stop function.

I would appreciate any help I can get.

import requests
from  bs4 import BeautifulSoup

url = "http://www.myschoolwebsite.com/1" 
#url2 = "http://www.myschoolwebsite.com/2"
r = requests.get(url)

soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})

for item in g_data:
    for li in item.findAll('li'):
        for resultnameh2 in li.findAll('h2'):
            for resultname in resultnameh2.findAll('a'):
                print(resultname).text
    for resultAddress in li.findAll('p', {"class": "resultAddress"}):
        print(resultAddress).text.replace('Get directions','').strip()   
    for resultContact in li.findAll('ul', {"class": "resultContact"}):
        for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
            print(resultContact).text

Upvotes: 1

Views: 939

Answers (2)

aash
aash

Reputation: 1323

First, you can assume the maximum no. of pages of the directory (if you know pattern of the url). I am assuming the url is of the form http://base_url/page Next you can write this:

base_url = 'http://www.myschoolwebsite.com'
total_pages = 100

def parse_content(r):
    soup = BeautifulSoup(r.content,'lxml')
    g_data = soup.find_all('ul', {"class": "searchResults"})

    for item in g_data:
        for li in item.findAll('li'):
            for resultnameh2 in li.findAll('h2'):
                for resultname in resultnameh2.findAll('a'):
                    print(resultname).text
        for resultAddress in li.findAll('p', {"class": "resultAddress"}):
            print(resultAddress).text.replace('Get directions','').strip()   
        for resultContact in li.findAll('ul', {"class": "resultContact"}):
            for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
                print(resultContact).text

for page in range(1, total_pages):
    response = requests.get(base_url + '/' + str(page))
    if response.status_code != 200:
        break

    parse_content(response)

Upvotes: 1

isopach
isopach

Reputation: 1928

I would make an array with all the URLs and loop through it, or if there is a clear pattern, write a regex to search for that pattern.

Upvotes: 1

Related Questions