Alex K.
Alex K.

Reputation: 855

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.

I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.

I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:

Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all) 
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers}, cookies={all_cookies})

But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?

Upvotes: 0

Views: 725

Answers (2)

Hao Lyu
Hao Lyu

Reputation: 186

Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.

You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.

Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Upvotes: 1

Steve
Steve

Reputation: 46

What you need is a headless browser for this since request module can not handle AJAX well.

One of such headless browser is selenium.

i.e.)

driver.find_element_by_id("show more").click() # This is just an example case

Upvotes: 1

Related Questions