Reputation:
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Upvotes: 2
Views: 4397
Reputation: 2952
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.
Upvotes: 0
Reputation: 18387
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
Upvotes: 2
Reputation: 123518
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
Upvotes: 0
Reputation: 881695
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Upvotes: 6
Reputation: 40052
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Upvotes: 6