Vindhya G
Vindhya G

Reputation: 1369

How to crawl a web site where page navigation involves dynamic loading

I want to crawl a website having multiple pages and when a page number is clicked it is dynamically loaded.How to screen scrape it?

i.e as the url is not present as href or a how to crawl to other pages?

Would be greatful if someone helped me on this.

PS:URL remains the same when different page is clicked.

Upvotes: 3

Views: 5156

Answers (6)

DY.Feng
DY.Feng

Reputation: 1

If you don't mind using gevent.GRobot is another good choose.

Upvotes: 0

furins
furins

Reputation: 5048

You should consider also Ghost.py, since it allows you tu run arbitrary javascript commands, fill forms and take snapshoot very quickly.

Upvotes: 2

shanks
shanks

Reputation: 2017

if you are using google chrome, you can check the url which is dynamically being called in network->headers of the developer tools

so based on that you can identify whether it is a GET or POST request.

If it is a GET request you can find the parameters straight away from the url.

If it is a POST request you can find the parameters from form data in network->headers of the developer tools.

Upvotes: 1

Thai Tran
Thai Tran

Reputation: 9935

You can not do that easily since it is an ajax pagination (even with mechanize). Instead, open the source file of the page and try to know what is the url request used for the ajax pagination. Then, you can create a fake request back and process the returned data by your own way

Upvotes: 0

Nirvana Tikku
Nirvana Tikku

Reputation: 4205

Since this post has been tagged with python and web-crawler, Beautiful Soup has to be mentioned: http://www.crummy.com/software/BeautifulSoup/

Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html

Upvotes: 0

agoebel
agoebel

Reputation: 401

You could look for the data you want in the javascript code instead of the HTML. This is usually a pain but you can do fun things with regular expressions.

Alternatively, some of the browser testing libraries like splinter work by loading the page up in an actual browser like firefox or chrome before scraping. One of those would work if you are running this on a machine with a browser installed.

Upvotes: 0

Related Questions