crawl dynamic data using scrapy

Question

I try to get the product rating information from target.com. The URL for the product is

http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty

After looking through response.body, I find out that the rating information is not statically loaded. So I need to get using other ways. I find some similar questions saying in order to get dynamic data, I need to

find out the correct XHR and where to send request
use FormRequest to get the right json
parse json (if I am wrong about the steps please tell me)

I am stuck at step 2 right now, i find out that one XHR named 15258543 contained rating distribution, but I don't know how can I sent a request to get the json. Like to where and use what parameter.

Can someone can walk me through this? Thank you!

alecxe · Accepted Answer

The trickiest thing is to get that 15258543 product ID dynamically and then use it inside the URL to get the reviews. This product ID can be found in multiple places on the product page, for instance, there is a meta element that we can use:

Here is a working spider that makes a separate GET request to get the reviews, loads the JSON response via json.loads() and prints the overall product rating:

import json

import scrapy

class TargetSpider(scrapy.Spider):
    name = "target"
    allowed_domains = ["target.com"]
    start_urls = ["http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty"]

    def parse(self, response):
        product_id = response.xpath("//meta[@itemprop='productID']/@content").extract_first()

        return scrapy.Request("http://tws.target.com/productservice/services/reviews/v1/reviewstats/" + product_id,
                              callback=self.parse_ratings,
                              meta={"product_id": product_id})

    def parse_ratings(self, response):
        data = json.loads(response.body)

        print(data["result"][response.meta["product_id"]]["coreStats"]["AverageOverallRating"])

Prints 4.5585.

crawl dynamic data using scrapy

Answers (1)

Related Questions