Reputation: 1785
I am using the following code to scrape the website. The following which I tried works fine for a page in the website. Now I want to scrape several such pages for which I am looping the URL as shown below.
from bs4 import BeautifulSoup
import urllib2
import csv
import re
number = 2500
for i in xrange(2500,7000):
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
print '\n'
number = number + 1
The following is the normal code without loop
from bs4 import BeautifulSoup
import urllib2
import csv
import re
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id=4591")
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
print re.sub(r'\s+',' ',''.join(eachuniversity.findAll(text=True)).encode('utf-8'))
I am looping the id
value in the URL from 2500 to 7000. But there are many id
's for which there is no value. So there are no such pages. How do I skip those pages and scrape data only when there exists data for given id
.
Upvotes: 1
Views: 3105
Reputation: 3513
you can either try catch the result (https://stackoverflow.com/questions/6092992/why-is-it-easier-to-ask-forgiveness-than-permission-in-python-but-not-in-java):
for i in xrange(2500,7000):
try:
page = urllib2.urlopen("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
except:
continue
else:
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
print '\n'
number = number + 1
or use a (great) lib such as requests and check before scrapping
import requests
for i in xrange(2500,7000):
page = requests.get("http://bvet.bytix.com/plus/trainer/default.aspx?id={}".format(i))
if not page.ok:
continue
soup = BeautifulSoup(requests.text)
for eachuniversity in soup.findAll('fieldset',{'id':'ctl00_step2'}):
print re.sub(r'\s+',' ',','.join(eachuniversity.findAll(text=True)).encode('utf-8'))
print '\n'
number = number + 1
basically there's no way for you to know if the page with that id exists before calling the url.
Upvotes: 2
Reputation: 11396
try to find an index page on that site, otherwise, you simply can't tell before trying to reach the URL
Upvotes: 0