Reputation: 1785
I am trying to scrape the data from a website using beautiful soup4 and python. Here is my code
from bs4 import BeautifulSoup
import urllib2
i = 0
for i in xrange(0,38):
page=urllib2.urlopen("http://www.sfap.org/klsfaprep_search?page={}&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form" %i)
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'field-item odd'}):
print ''.join(eachuniversity.findAll(text=True)).encode('utf-8')
print ',\n'
i= i+ 1
I think the problem is in the URL that I have given and in the increment statement. I am able to scrape page by page. But only when I give the xrange.
Upvotes: 0
Views: 3217
Reputation: 369124
ValueError
You're mixing {}
formatting with %
formatting.
>>> '{}%20la' % 1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unsupported format character 'a' (0x61) at index 6
>>> '{}%20la'.format(1)
'1%20la'
I recommend you to use {}
formatting, because in URL, there are multiple %
s.
page=urllib2.urlopen("http://www.sfap.org/klsfaprep_search?page={}&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form".format(i))
You don't need i = 0
and i = i + 1
because for i in xrange(0,38)
take care of it.
import urllib2 # Import standard library module first. (PEP-8)
from bs4 import BeautifulSoup
for i in xrange(0,38):
page = urllib2.urlopen("http://www.sfap.org/klsfaprep_search?page={}&type=1&strname=&loc=&op=Lancer%20la%20recherche&form_build_id=form-72a297de309517ed5a2c28af7ed15208&form_id=klsfaprep_search_form" .format(i))
soup = BeautifulSoup(page.read())
for eachuniversity in soup.findAll('div',{'class':'field-item odd'}):
print ''.join(eachuniversity.findAll(text=True)).encode('utf-8')
print ',\n'
Upvotes: 2