user6810221
user6810221

Reputation:

Scraping several pages with BeautifulSoup

I'd like to scrape through several pages of a website using Python and BeautifulSoup4. The pages differ by only a single number in their URL, so I could actually make a declaration like this:

theurl = "beginningofurl/" + str(counter) + "/endofurl.html"

The link I've been testing with is this:

And my python script is this.

import urllib
import urllib.request
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''

    pager = 1

    while pager < 11:
        theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
        thepage = urllib.request.urlopen(theurl)
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.findAll('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

So the question is: how to change the hardcoded number in the while loop into a solution that makes the script automatically recognize that it passed the last page, and then it quits automatically?

Upvotes: 1

Views: 1888

Answers (3)

alecxe
alecxe

Reputation: 473763

The idea is to have an endless loop and break it once you don't have the "arrow right" element on the page which would mean you are on the last page, simple and quite logical:

import requests
from bs4 import BeautifulSoup


page = 1
url = "http://www.worldofquotes.com/topic/Nature/{page}/index.html"
with requests.Session() as session:
    while True:
        response = session.get(url.format(page=page))
        soup = BeautifulSoup(response.content, "html.parser")

        # TODO: parse the page and collect the results

        if soup.find(class_="icon-arrow-right") is None:
            break  # last page

        page += 1

Upvotes: 2

TuanDT
TuanDT

Reputation: 1667

Here is my attempt.

Minor issue: put a try-except block in the code in case the redirection leads you somewhere that doesn't exist.

Now, the main issue: how to avoid parsing stuff you already parsed. Keep a record of urls that you have parsed. Then detect if the actual url from the page urllib is reading from (using the geturl() method from from thepage) has already been read. Worked on my Mac OSX machine.

Note: there are 10 pages in total according to what I see from the website and this method does not require prior knowledge about the page's HTML- it works in general.

import urllib
import urllib.request
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''
    urlarchive = [];
    pager = 1
    while True:
        theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
        thepage = None;
        try:
            thepage = urllib.request.urlopen(theurl)
            if thepage.geturl() in urlarchive:
                break;
            else:
                urlarchive.append(thepage.geturl());
                print(pager);
        except:
            break;
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.findAll('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

Upvotes: 0

Nuno Andr&#233;
Nuno Andr&#233;

Reputation: 5339

Try with requests (avoiding redirections) and check if you get new quotes.

import requests
from bs4 import BeautifulSoup


def category_crawler():
    ''' This function will crawl through an entire category, regardless how many pages it consists of. '''

    pager = 1

    while pager < 11:
        theurl = "http://www.worldofquotes.com/topic/Art/"+str(pager)+"/index.html"
        thepage = requests.get(theurl, allow_redirects=False).text
        soup = BeautifulSoup(thepage, "html.parser")

        for link in soup.find_all('blockquote'):
            sanitized = link.find('p').text.strip()
            spantext = link.find('a')
            writer = spantext.find('span').text
            if not sanitized:
                break
            print(sanitized)
            print(writer)
            print('---------------------------------------------------------')


        pager += 1

category_crawler()

Upvotes: 0

Related Questions