CarbeneHu
CarbeneHu

Reputation: 3

How to handle ajax data with scrapy

I'm making a web spider with scrapy and there comes a problem:I tried to get a group of html data.And it contains the id i need to send ajax request.However,when I tried to get the ajax data together with other data I got with the html , it just goes wrong.How could I solve it?Here's my code:

class DoubanSpider(scrapy.Spider):

    name = "douban"
    allowed_domains = ["movie.douban.com"]
    start_urls = ["https://movie.douban.com/review/best"]

    def parse(self, response):
        for review in response.css(".review-item"):
            rev = Review()
            rev['reviewer'] = review.css("a[property='v:reviewer']::text").extract_first()
            rev['rating'] = review.css("span[property='v:rating']::attr(class)").extract_first()
            rev['title'] = review.css(".main-bd>h2>a::text").extract_first()
            number = review.css("::attr(id)").extract_first()
            f = scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number,
                                     callback=self.parse_full_passage)
            rev['comment'] = f
            yield rev

    def parse_full_passage(self, response):
        r = json.loads(response.body_as_unicode())
        html = r['html']
        yield html

Upvotes: 0

Views: 1019

Answers (2)

gangabass
gangabass

Reputation: 10666

You need to fully parse your HTML first and next pass it as a meta to the JSON's callback:

yield scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number,callback=self.parse_full_passage, meta={'rev': rev} )

And next in your JSON's callback:

def parse_full_passage(self, response):
    rev = response.meta["rev"]
    r = json.loads(response.body_as_unicode())
    .....
    yield rev

Upvotes: 1

Diego Amicabile
Diego Amicabile

Reputation: 589

I would try this:

 response = scrapy.Request(url='https://movie.douban.com/j/review/%s/full' % number)
 jsonresponse = json.loads(response.body_as_unicode())
 rev['comment'] = jsonresponse['html']

You might want to extract stuff from the html field if this is what you need. Alternatively work with this url

Upvotes: 1

Related Questions