Loop URL to scrape using beautiful soup python

Question

I am using the following code to scrape the website. The following which I tried works fine for a page in the website. Now I want to scrape several such pages for which I am looping the URL as shown below.

from bs4 import BeautifulSoup
import urllib2
import csv
import re
number = 2500
for i in xrange(2500,7000):
    page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    soup = BeautifulSoup(page.read())
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '
'
        number = number + 1

The following is the normal code without loop

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id=4591")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
    print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8'))

I am looping the id value in the URL from 2500 to 7000. But there are many id's for which there is no value. So there are no such pages. How do I skip those pages and scrape data only when there exists data for given id.

astreal · Accepted Answer

you can either try catch the result (https://stackoverflow.com/questions/6092992/why-is-it-easier-to-ask-forgiveness-than-permission-in-python-but-not-in-java):

for i in xrange(2500,7000):
    try:
        page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    except:
        continue
    else:
        soup = BeautifulSoup(page.read())
        for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
            print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
            print '
'
            number = number + 1

or use a (great) lib such as requests and check before scrapping

import requests
for i in xrange(2500,7000):
    page = requests.get("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    if not page.ok:
        continue
    soup = BeautifulSoup(requests.text)
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '
'
        number = number + 1

basically there's no way for you to know if the page with that id exists before calling the url.

Loop URL to scrape using beautiful soup python

Answers (2)

Related Questions