Reputation:
I'd like to scrape through several pages of a website using Python and BeautifulSoup4. The pages differ by only a single number in their URL, so I could actually make a declaration like this:
theurl = "beginningofurl/" + str(counter) + "/endofurl.html"
The link I've been testing with is this:
And my python script is this.
import urllib
import urllib.request
from bs4 import BeautifulSoup
def category_crawler():
''' This function will crawl through an entire category, regardless how many pages it consists of. '''
pager = 1
while pager < 11:
theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")
for link in soup.findAll('blockquote'):
sanitized = link.find('p').text.strip()
spantext = link.find('a')
writer = spantext.find('span').text
print(sanitized)
print(writer)
print('---------------------------------------------------------')
pager += 1
category_crawler()
So the question is: how to change the hardcoded number in the while loop into a solution that makes the script automatically recognize that it passed the last page, and then it quits automatically?
Upvotes: 1
Views: 1888
Reputation: 473763
The idea is to have an endless loop and break it once you don't have the "arrow right" element on the page which would mean you are on the last page, simple and quite logical:
import requests
from bs4 import BeautifulSoup
page = 1
url = "http://www.worldofquotes.com/topic/Nature/{page}/index.html"
with requests.Session() as session:
while True:
response = session.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
# TODO: parse the page and collect the results
if soup.find(class_="icon-arrow-right") is None:
break # last page
page += 1
Upvotes: 2
Reputation: 1667
Here is my attempt.
Minor issue: put a try-except
block in the code in case the redirection leads you somewhere that doesn't exist.
Now, the main issue: how to avoid parsing stuff you already parsed. Keep a record of urls that you have parsed. Then detect if the actual url from the page urllib
is reading from (using the geturl()
method from from thepage
) has already been read. Worked on my Mac OSX machine.
Note: there are 10 pages in total according to what I see from the website and this method does not require prior knowledge about the page's HTML- it works in general.
import urllib
import urllib.request
from bs4 import BeautifulSoup
def category_crawler():
''' This function will crawl through an entire category, regardless how many pages it consists of. '''
urlarchive = [];
pager = 1
while True:
theurl = "http://www.worldofquotes.com/topic/Nature/"+str(pager)+"/index.html"
thepage = None;
try:
thepage = urllib.request.urlopen(theurl)
if thepage.geturl() in urlarchive:
break;
else:
urlarchive.append(thepage.geturl());
print(pager);
except:
break;
soup = BeautifulSoup(thepage, "html.parser")
for link in soup.findAll('blockquote'):
sanitized = link.find('p').text.strip()
spantext = link.find('a')
writer = spantext.find('span').text
print(sanitized)
print(writer)
print('---------------------------------------------------------')
pager += 1
category_crawler()
Upvotes: 0
Reputation: 5339
Try with requests
(avoiding redirections) and check if you get new quotes.
import requests
from bs4 import BeautifulSoup
def category_crawler():
''' This function will crawl through an entire category, regardless how many pages it consists of. '''
pager = 1
while pager < 11:
theurl = "http://www.worldofquotes.com/topic/Art/"+str(pager)+"/index.html"
thepage = requests.get(theurl, allow_redirects=False).text
soup = BeautifulSoup(thepage, "html.parser")
for link in soup.find_all('blockquote'):
sanitized = link.find('p').text.strip()
spantext = link.find('a')
writer = spantext.find('span').text
if not sanitized:
break
print(sanitized)
print(writer)
print('---------------------------------------------------------')
pager += 1
category_crawler()
Upvotes: 0