Reputation: 3589
I'm referencing from this tutorial and it works for getting data on the first page, and the following a link afterwards.
However, in my example, I am trying to check if listing has 3
things before I click on the listing link:
If so, I want scrapy
to click on the business link that goes to the business profile where I am able to retrieve the email.
After that, I want scrapy
to go back to the main page and repeat the process for the rest of 19 listings on that page.
Yet, it outputs a list of duplicates like this:
service_name = input("Input Industry: ")
city = input("Input The City: ")
class Item(scrapy.Item):
business_name = scrapy.Field()
phonenumber = scrapy.Field()
email = scrapy.Field()
website = scrapy.Field()
class Bbb_spider(scrapy.Spider):
name = "bbb"
start_urls = [
"http://www.yellowbook.com/s/"+ service_name + "/" + city
]
def __init__(self):
self.seen_business_names = []
self.seen_websites = []
self.seen_emails = []
def parse(self, response):
for business in response.css('div.listing-info'):
item = Item()
item['business_name'] = business.css('div.info.l h2 a::text').extract()
item['website'] = business.css('a.s_website::attr(href)').extract()
for x in item['business_name'] and item['website']:
if x not in self.seen_business_names and item['website']:
if item['business_name']:
if item['website']:
item['phonenumber'] = business.css('div.phone-number::text').extract_first()
for href in response.css('div.info.l h2 a::attr(href)'):
yield response.follow(href, self.businessprofile)
for href in response.css('ul.page-nav.r li a::attr(href)'):
yield response.follow(href, self.parse)
def businessprofile(self, response):
for profile in response.css('div.profile-info.l'):
item = Item()
item['email'] = profile.css('a.email::text').extract()
for x in item['email']:
if x not in self.seen_emails:
self.seen_business_names.append(x)
yield item
python scrapy
Any suggestions on how to improve the code?
Upvotes: 0
Views: 36
Reputation: 1908
Read the guide thorougly before doing a spider. To populate items, you should use an Item Loader, that can have post and pre-processors useful for your purpose. For duplication, you can use a custom pipeline.
Upvotes: 1