Reputation: 3599
I am trying to crawl a page and get a phone number from the pages that im crawling, and yes i have referenced the beautifulsoup documentation just need to know how to crawl pages and get information from that page, any suggestions ?
here is the code
Main.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import urllib
from bs4 import BeautifulSoup
glimit = 100
def my_spider(max_pages):
page = 2
while page <= max_pages:
url = 'http://www.bbb.org/search/?type=name&input=constrution&location=Austin%2c+TX&filter=combined&accredited=&radius=5000&country=USA&language=en&codeType=YPPA'
url_2 = url + '&page='+ str(page) +'&source=bbbse'
source_code = requests.get(url_2)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html5lib")
limit = glimit
li = soup.find('h4', {'class': 'hcolor'})
children = li.find_all("a")
for result in children:
href = "http://www.bbb.org" + result.get('href')
owl = (result.string)
print owl
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html5lib")
limit = glimit
mysoup = soup.findAll('h3',{'class': 'address__heading' })[:limit]
mysoup2 = mysoup.find_all("a")
for item in mysoup2:
href = "http://www.bbb.org" + item.get('href')
print (item.string)
my_spider(2)
and here is the error
Traceback (most recent call last):
File "main.py", line 44, in <module>
my_spider(2)
File "main.py", line 27, in my_spider
get_single_item_data(href)
File "main.py", line 33, in get_single_item_data
source_code = requests.get(item_url)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 421, in request
prep = self.prepare_request(req)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 359, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 287, in prepare
self.prepare_url(url, params)
File "/usr/lib/python2.7/dist-packages/requests/models.py", line 334, in prepare_url
scheme, auth, host, port, path, query, fragment = parse_url(url)
File "/usr/lib/python2.7/dist-packages/urllib3/util.py", line 390, in parse_url
raise LocationParseError("Failed to parse: %s" % url)
urllib3.exceptions.LocationParseError: Failed to parse: Failed to parse: www.bbb.orghttp:
Upvotes: 0
Views: 850
Reputation: 8392
You have various issues in your code.
1) You don't need to have href = "http://www.bbb.org" +
. Remove "http://www.bbb.org"
as the links already have the host there.
2)
mysoup = soup.findAll('h3',{'class': 'address__heading' })[:limit]
mysoup2 = mysoup.find_all("a")
You are trying to find a
tags in a list. You'll have to iterate mysoup
or use find
instead of findAll
.
I've updated your code. Find it here.
Upvotes: 1