Python crawler. Parsing and executing ajax

Question

I have a basic structure set up for a crawler. Now I released it on some php driven websites and it works like a charm. Though now I want to it to build datasheets from ajax content.

At the moment I am using Mechanize for PYTHON and perl to build my crawler. Though Mechanize module does not execute AJAX. How do i get to the content that is build by asynchronomous ajax?

I know there is something called Selenium, a real browser to automate. But is this my only option?

RanRag · Accepted Answer

Your can run a headless browser e.g phantomjs which understands JavaScript, DOM etc but you will have to write your code in Javascript, benefit is that you can do whatever you want.

There is another way but its messy.

You could observe what requests are made when you click the button (using Firebug in Firefox or Developer Tools in Chrome). Than try to reverse engineer the javascript running behind the page and try to do the similar thing using your python code, for that take a look at Spidermonkey

Python crawler. Parsing and executing ajax

Answers (1)

Related Questions