Scrapy: Scraping nested links

Question

I am new to Scrapy and web scraping. Please don't get mad. I am trying to scrape profilecanada.com. Now, when I ran the code below, no errors are given but I think it still not scraping. In my code, I am trying to start in a page where there is a list of link. Each link leads to a page where there is also another list of link. From that link is another page that lies the data that I needed to extract and save into a json file. In general, it something like "nested link scraping". I don't know how it is actually called. Please see the image below for the result of spider when I rant it. Thank you in advance for your help.

import scrapy

class ProfilecanadaSpider(scrapy.Spider):
    name = 'profilecanada'
    allowed_domains = ['http://www.profilecanada.com']
    start_urls = ['http://www.profilecanada.com/browse_by_category.cfm/']

    def parse(self, response):

      # urls in from start_url
      category_list_urls =  response.css('div.div_category_list > div.div_category_list_column > ul > li.li_category > a::attr(href)').extract()
      # start_u = 'http://www.profilecanada.com/browse_by_category.cfm/'

      # for each category of company
      for url in category_list_urls:
        url = url[3:]
        url = response.urljoin(url)
        return scrapy.Request(url=url, callback=self.profileCategoryPages)


    def profileCategoryPages(self, response):
      company_list_url = response.css('div.dv_en_block_name_frame > a::attr(href)').extract()

      # for each company in the list
      for url in company_list_url:
        url = response.urljoin(url)
        return  scrapy.Request(url=url, callback=self.companyDetails)

    def companyDetails(self, response):
      return {
        'company_name': response.css('span#name_frame::text').extract_first(),
        'street_address': str(response.css('span#frame_addr::text').extract_first()),
        'city': str(response.css('span#frame_city::text').extract_first()),
        'region_or_province': str(response.css('span#frame_province::text').extract_first()),
        'postal_code': str(response.css('span#frame_postal::text').extract_first()),
        'country': str(response.css('div.type6_GM > div > div::text')[-1].extract())[2:],
        'phone_number': str(response.css('span#frame_phone::text').extract_first()),
        'fax_number': str(response.css('span#frame_fax::text').extract_first()),
        'email': str(response.css('span#frame_email::text').extract_first()),
        'website': str(response.css('span#frame_website > a::attr(href)').extract_first()),
      }

IMAGE RESULT IN CMD: The result in cmd when I ran the spider

Scrapy: Scraping nested links

Answers (1)

Related Questions