codeScriber
codeScriber

Reputation: 4612

Scraping a web page with java script in Python

i'm working in python 3.2 (newb) on windows machine (though i have ubuntu 10.04 on virtual box if needed, but i prefer to work on the windows machine).

Basically i'm able to work with the http module and urlib module to scrape web pages, but only those that don't have java script document.write("<div....") and the like that adds data that is not there while i get the actual page (meaning without real ajax scripts).

To process those kind of sites as well i'm pretty sure i need a browser java script processor to work on the page and give me an output with the final result, hopefully as a dict or text.

I tried to compile python-spider monkey but i understand that it's not for windows and it's not working with python 3.x :-?

Any suggestions ? if anyone did something like that before i'll appreciate the help!

Upvotes: 0

Views: 10235

Answers (3)

hoju
hoju

Reputation: 29472

I recommend python's bindings to the webkit library - here is an example. Webkit is cross platform and is used to render webpages in Chrome and Safari. An excellent library.

Upvotes: 2

Henley Wing Chiu
Henley Wing Chiu

Reputation: 22535

Use Firebug to see exactly what is being called to get the data to display (a POST or GET url?). I suspect there's an AJAX call that's retrieving the data from the server either as XML or JSON. Just call the same AJAX call, and parse the data yourself.

Optionally, you can download Selenium for Firefox, start a Selenium server, download the page via Selenium, and get the DOM contents. MozRepl works as well, but doesn't have as much documentation since it's not widely used.

Upvotes: 1

Lennart Regebro
Lennart Regebro

Reputation: 172359

document.write is usually used because you are generating the content on the fly, often by fetching data from a server. What you get are web apps that are more about javascript than HTML. "Scraping" is rather more a question of downloading HTML and processing it, but here there isn't any HTML to download. You are essentially trying to scrape a GUI program.

Most of these applications have some sort of API, often returning XML or JSON data, that you can use instead. If it doesn't, your should probably try to remote control a real webbrowser instead.

Upvotes: 0

Related Questions