BARNOWL
BARNOWL

Reputation: 3589

scrapy spider not following link to another page

I'm referencing from this tutorial and it works for getting data on the first page, and the following a link afterwards.

However, in my example, I am trying to check if listing has 3 things before I click on the listing link:

  1. item must have business name
  2. item must have phone number
  3. item must have a website

If so, I want scrapy to click on the business link that goes to the business profile where I am able to retrieve the email.

After that, I want scrapy to go back to the main page and repeat the process for the rest of 19 listings on that page.

Yet, it outputs a list of duplicates like this:

enter image description here

service_name = input("Input Industry: ")
city = input("Input The City: ")


class Item(scrapy.Item):    
    business_name = scrapy.Field()
    phonenumber = scrapy.Field()
    email = scrapy.Field()
    website = scrapy.Field()

class Bbb_spider(scrapy.Spider):
    name = "bbb"

    start_urls = [
        "http://www.yellowbook.com/s/"+ service_name + "/" + city
    ]

    def __init__(self):
        self.seen_business_names = []
        self.seen_websites = []
        self.seen_emails = []


    def parse(self, response):
        for business in response.css('div.listing-info'):
            item = Item()
            item['business_name'] = business.css('div.info.l h2 a::text').extract()
            item['website'] = business.css('a.s_website::attr(href)').extract()
            for x in item['business_name'] and item['website']:
                if x not in self.seen_business_names and item['website']:
                    if item['business_name']:
                        if item['website']:
                            item['phonenumber'] = business.css('div.phone-number::text').extract_first()
                            for href in response.css('div.info.l h2 a::attr(href)'):
                                yield response.follow(href, self.businessprofile)

            for href in response.css('ul.page-nav.r li a::attr(href)'):
                yield response.follow(href, self.parse)

    def businessprofile(self, response):
        for profile in response.css('div.profile-info.l'):
            item = Item()
            item['email'] = profile.css('a.email::text').extract()
            for x in item['email']:
                if x not in self.seen_emails:
                    self.seen_business_names.append(x)
                    yield item
python scrapy

Any suggestions on how to improve the code?

Upvotes: 0

Views: 36

Answers (1)

Lore
Lore

Reputation: 1908

Read the guide thorougly before doing a spider. To populate items, you should use an Item Loader, that can have post and pre-processors useful for your purpose. For duplication, you can use a custom pipeline.

Upvotes: 1

Related Questions