YASPLS
YASPLS

Reputation: 71

Scraping multiple pages WITHIN multiple pages (Scrapy)

I'm struggling to get my head around the structure of the code I need to set up to scrape multiple pages within multiple pages. Here is what I mean:

  1. I start on the main page that has URL's for all the alphabet letters. Each letter is the starting letter of dog breed names.
  2. For each letter, there are multiple pages of dog breeds. I need to go into every dog breed page.
  3. For each dog breed there are multiple pages of dogs listed for sale. I need to pull the data from each sale listing page.

As stated before, I'm struggling to understand what the structure of my code needs to look like. Part of the issue is I don't fully understand how the python code flow works. Would something like this be correct:

def parse
       Get URL of all the alphabet letters
       pass on the URL to parse_A

def parse_A
      Get URL of all pages for that alphabet letter
      pass on the URL to parse_B

def parse_B
      Get URL for all breeds listed on that page of that alphabet letter
      pass on the URL to parse_C

def parse_C
      Get URL for all the pages of dogs listed of that specific breed
      pass on the URL to parse_D

def parse_D
      Get URL of specific for sale listing of that dog breed on that page
      pass on the URL to parse_E

def parse_E
     Get all of the details for that specific listing
     Callback to ??

For the final callback in parse_E, do I direct the callback to parse_D or to the very first parse?

Thank you!

Upvotes: 0

Views: 157

Answers (1)

Arun Augustine
Arun Augustine

Reputation: 1766

You have to follow the structure like below using scrapy.

def parse():
    """
    Get URL of all URLs from the alphabet letters (breed_urls)
    :return:
    """
    breed_urls = 'parse the urls'
    for url in breed_urls:
        yield scrapy.Request(url=url, callback=self.parse_sub_urls)


def parse_sub_urls(response):
    """
    Get URL of all SubUrls from the subPage (sub_urls)
    :param response:
    :return:
    """
    sub_urls= 'parse the urls'
    for url in sub_urls:
        yield scrapy.Request(url=url, callback=self.parse_details)

    next_page = 'parse the page url'
    if next_page:
        yield scrapy.Request(url=next_page, callback=self.parse_sub_urls)

def parse_details(response):
    """
    Get the final details from the listing page
    :param response:
    :return:
    """

    details = {}
    name = 'parse the urls'
    details['name'] = name

    # parse all other details and append to the dictionary

    yield details

Upvotes: 3

Related Questions