Reputation: 71
I'm struggling to get my head around the structure of the code I need to set up to scrape multiple pages within multiple pages. Here is what I mean:
As stated before, I'm struggling to understand what the structure of my code needs to look like. Part of the issue is I don't fully understand how the python code flow works. Would something like this be correct:
def parse
Get URL of all the alphabet letters
pass on the URL to parse_A
def parse_A
Get URL of all pages for that alphabet letter
pass on the URL to parse_B
def parse_B
Get URL for all breeds listed on that page of that alphabet letter
pass on the URL to parse_C
def parse_C
Get URL for all the pages of dogs listed of that specific breed
pass on the URL to parse_D
def parse_D
Get URL of specific for sale listing of that dog breed on that page
pass on the URL to parse_E
def parse_E
Get all of the details for that specific listing
Callback to ??
For the final callback in parse_E, do I direct the callback to parse_D or to the very first parse?
Thank you!
Upvotes: 0
Views: 157
Reputation: 1766
You have to follow the structure like below using scrapy.
def parse():
"""
Get URL of all URLs from the alphabet letters (breed_urls)
:return:
"""
breed_urls = 'parse the urls'
for url in breed_urls:
yield scrapy.Request(url=url, callback=self.parse_sub_urls)
def parse_sub_urls(response):
"""
Get URL of all SubUrls from the subPage (sub_urls)
:param response:
:return:
"""
sub_urls= 'parse the urls'
for url in sub_urls:
yield scrapy.Request(url=url, callback=self.parse_details)
next_page = 'parse the page url'
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse_sub_urls)
def parse_details(response):
"""
Get the final details from the listing page
:param response:
:return:
"""
details = {}
name = 'parse the urls'
details['name'] = name
# parse all other details and append to the dictionary
yield details
Upvotes: 3