Venkateshwaran Selvaraj
Venkateshwaran Selvaraj

Reputation: 1785

Loop URL to scrape using beautiful soup python

I am using the following code to scrape the website. The following which I tried works fine for a page in the website. Now I want to scrape several such pages for which I am looping the URL as shown below.

from bs4 import BeautifulSoup
import urllib2
import csv
import re
number = 2500
for i in xrange(2500,7000):
    page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    soup = BeautifulSoup(page.read())
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

The following is the normal code without loop

from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id=4591")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
    print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8'))

I am looping the id value in the URL from 2500 to 7000. But there are many id's for which there is no value. So there are no such pages. How do I skip those pages and scrape data only when there exists data for given id.

Upvotes: 1

Views: 3105

Answers (2)

astreal
astreal

Reputation: 3513

you can either try catch the result (https://stackoverflow.com/questions/6092992/why-is-it-easier-to-ask-forgiveness-than-permission-in-python-but-not-in-java):

for i in xrange(2500,7000):
    try:
        page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    except:
        continue
    else:
        soup = BeautifulSoup(page.read())
        for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
            print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
            print '\n'
            number = number + 1

or use a (great) lib such as requests and check before scrapping

import requests
for i in xrange(2500,7000):
    page = requests.get("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
    if not page.ok:
        continue
    soup = BeautifulSoup(requests.text)
    for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
        print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
        print '\n'
        number = number + 1

basically there's no way for you to know if the page with that id exists before calling the url.

Upvotes: 2

Guy Gavriely
Guy Gavriely

Reputation: 11396

try to find an index page on that site, otherwise, you simply can't tell before trying to reach the URL

Upvotes: 0

Related Questions