Tharun Niranjan
Tharun Niranjan

Reputation: 21

Python Scrapy - Ajax Pagination Tripadvisor

I'm using Python-Scrapy to scrap the reviews of tripadvisor members pages. Here is the url I'm using : http://www.tripadvisor.com/members/scottca075

I'm able to get the first page using scrapy. I haven't been able to get the other pages. I observed the XHR Request in the Network Tab of the browser on clicking Next button.

One GET and One POST request is sent: On checking the parameters for the GET request, I see this:

action : undefined_Other_ClickNext_REVIEWS_ALL
gaa : Other_ClickNext_REVIEWS_ALL
gal : 50
gams : 0
gapu : Vq85qQoQKjYAABktcRMAAAAh
gass : members`

The request url is

 `http://www.tripadvisor.com/ActionRecord?action=undefined_Other_ClickNext_REVIEWS_ALL&gaa=Other_ClickNext_REVIEWS_ALL&gal=0&gass=members&gapu=Vq8xPAoQLnMAAUutB9gAAAAJ&gams=1`

The parameter gal represents the offset. Each page has 50 reviews. On moving to the second page by clicking the next button, the parameter gal is set to 50. Then, 100,150,200..and so on.

The data that I want is in the POST request in json format. Image of JSON data in POST request. The request url on the post request is http://www.tripadvisor.com/ModuleAjax?

I'm confused as to how to make the request in scrapy to get the data. I tried using FormRequest as follows:

pagination_url = "http://www.tripadvisor.com/ActionRecord"
form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':'0','gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}
FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parseItem)

I also tried setting headers options in the FormRequest

headers = {'Host':'www.tripadvisor.com','Referer':'http://www.tripadvisor.com/members/prizm','X-Requested-With': 'XMLHttpRequest'}

If someone could explain what I'm missing and point me in the right direction that would be great. I have run out of ideas.

And also, I'm aware that I can use selenium. But I want to know if there is a faster way to do this.

Upvotes: 2

Views: 1570

Answers (2)

Awaish Kumar
Awaish Kumar

Reputation: 557

so for you are doing correct, add the yield in front of FormRequest as:

yield FormRequest(''')

secondly focus on the value of gal, because it is the only parameter changing here and don`t keep gal = "0".

Find the total number of reviews and start from 50 to total pages adding 50 with each request.

form_date = {'action':'undefined_Other_ClickNext_REVIEWS_ALL','gaa':'Other_ClickNext_REVIEWS_ALL', 'gal':reviews_till_this_page,'gams':'0','gapu':'Vq8EngoQL3EAAJKgcx4AAAAN','gass':'members'}

Upvotes: 0

Bhanu prathap
Bhanu prathap

Reputation: 94

Use ScrapyJS - Scrapy+JavaScript integration

To use ScrapyJS in your project, you first need to enable the middleware:

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this:

    import scrapy

    class MySpider(scrapy.Spider):
    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        # response.body is a result of render.html call; it
        # contains HTML processed by a browser.
        # …

A common scenario is that the user needs to click a button before the page is displayed. We can handle this using jQuery with Splash:

function main(splash)
    splash:autoload("https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js")
    splash:go("http://example.com")
    splash:runjs("$('#some-button').click()")
    return splash:html()
end

For more details check here

Upvotes: 2

Related Questions