Reputation: 1369
I want to crawl a website having multiple pages and when a page number is clicked it is dynamically loaded.How to screen scrape it?
i.e as the url is not present as href or a how to crawl to other pages?
Would be greatful if someone helped me on this.
PS:URL remains the same when different page is clicked.
Upvotes: 3
Views: 5156
Reputation: 5048
You should consider also Ghost.py, since it allows you tu run arbitrary javascript commands, fill forms and take snapshoot very quickly.
Upvotes: 2
Reputation: 2017
if you are using google chrome, you can check the url which is dynamically being called in
network->headers
of the developer tools
so based on that you can identify whether it is a GET
or POST
request.
If it is a GET
request you can find the parameters straight away from the url.
If it is a POST
request you can find the parameters from form data
in network->headers
of the developer tools.
Upvotes: 1
Reputation: 9935
You can not do that easily since it is an ajax pagination (even with mechanize). Instead, open the source file of the page and try to know what is the url request used for the ajax pagination. Then, you can create a fake request back and process the returned data by your own way
Upvotes: 0
Reputation: 4205
Since this post has been tagged with python and web-crawler, Beautiful Soup has to be mentioned: http://www.crummy.com/software/BeautifulSoup/
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs3/download/2.x/documentation.html
Upvotes: 0
Reputation: 401
You could look for the data you want in the javascript code instead of the HTML. This is usually a pain but you can do fun things with regular expressions.
Alternatively, some of the browser testing libraries like splinter work by loading the page up in an actual browser like firefox or chrome before scraping. One of those would work if you are running this on a machine with a browser installed.
Upvotes: 0