Reputation: 336
I am creating a crawler to get product info and product reviews and export to csv files from a specific category. For example, I need to get all the information from a pants category, so my crawling starts from there.
I can easily extract an each product link from there. But then I need the crawler to open up that link, fetch all the required information for each product. I also need it to fetch all the reviews for the product, but the problem is that the reviews have pagination too.
I start from here:
class SheinSpider(scrapy.Spider):
name = "shein_spider"
start_urls = [
"https://www.shein.com/Men-Pants-c-1978.html?icn=men-pants&ici=www_tab02navbar02menu01dir06&scici=navbar_3~~tab02navbar02menu01dir06~~2_1_6~~real_1978~~~~0~~0"
]
def parse(self, response):
for item in response.css('.js-good'):
yield {"product_url": item.css('.category-good-name a::attr(href)').get()}
I do know how to parse the info from the catalog list, but don't know how to make the crawler follow each link from the list.
Upvotes: 1
Views: 140
Reputation: 736
The way to follow links in scrapy is to just yield a scrapy.Request
object with the URL and the parse you want to use to process that link. From the scrapy documentation tutorial "Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes."
I would recommend checking the tutorial in Scrapy documentation here, especially the section called "Following links".
In your specific example, this is the code that will make it work. Be mindful that your product url needs to be complete and it could be that the href you are getting it from only has a relative url.
name = "shein_spider"
start_urls = [
"https://www.shein.com/Men-Pants-c-1978.html?icn=men-pants&ici=www_tab02navbar02menu01dir06&scici=navbar_3~~tab02navbar02menu01dir06~~2_1_6~~real_1978~~~~0~~0"
]
def parse(self, response):
for item in response.css('.js-good'):
product_url = item.css('.category-good-name a::attr(href)').get()
yield scrapy.Request(product_url, callback=self.parse_item)
def parse_item(self, response):
# Do what you want to do to process the product details page #
Upvotes: 1