user2628641
user2628641

Reputation: 2154

crawl dynamic data using scrapy

I try to get the product rating information from target.com. The URL for the product is

http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty

After looking through response.body, I find out that the rating information is not statically loaded. So I need to get using other ways. I find some similar questions saying in order to get dynamic data, I need to

  1. find out the correct XHR and where to send request
  2. use FormRequest to get the right json
  3. parse json (if I am wrong about the steps please tell me)

I am stuck at step 2 right now, i find out that one XHR named 15258543 contained rating distribution, but I don't know how can I sent a request to get the json. Like to where and use what parameter.

Can someone can walk me through this? Thank you!

Upvotes: 1

Views: 258

Answers (1)

alecxe
alecxe

Reputation: 473863

The trickiest thing is to get that 15258543 product ID dynamically and then use it inside the URL to get the reviews. This product ID can be found in multiple places on the product page, for instance, there is a meta element that we can use:

<meta itemprop="productID" content="15258543">

Here is a working spider that makes a separate GET request to get the reviews, loads the JSON response via json.loads() and prints the overall product rating:

import json

import scrapy

class TargetSpider(scrapy.Spider):
    name = "target"
    allowed_domains = ["target.com"]
    start_urls = ["http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty"]

    def parse(self, response):
        product_id = response.xpath("//meta[@itemprop='productID']/@content").extract_first()

        return scrapy.Request("http://tws.target.com/productservice/services/reviews/v1/reviewstats/" + product_id,
                              callback=self.parse_ratings,
                              meta={"product_id": product_id})

    def parse_ratings(self, response):
        data = json.loads(response.body)

        print(data["result"][response.meta["product_id"]]["coreStats"]["AverageOverallRating"])

Prints 4.5585.

Upvotes: 2

Related Questions