Freesnöw
Freesnöw

Reputation: 32143

Screen Scrape a site with python (server side)

I'm creating a site that represents a virtual companies website (in this case, a virtual bakery). I've already set up products and the cart system, the problem now is getting it to work with the virtual bank system. Everybody involved in the system has an account, I do too. Now, I'm pretty darn new to Python and I've been using a lot of prior created scripts and editing them slightly to my desire.

My Question...

Sorry, I would include a link to the site I'm trying to access, however, it seems to be down at the moment. How convenient.

Upvotes: 0

Views: 906

Answers (2)

ravenac95
ravenac95

Reputation: 3637

Unfortunately, there isn't a very good way to traverse a Javascript dependent site from within the context of Python (or anything outside of a browser). Even if you were to use Mechanize with python-spidermonkey, or some other form of Javascript bridge with Python (perhaps pyV8), those bridges alone don't emulate the DOM. Therefore any Javascript dealing with UI interaction just won't function.

However, if you the site that you're wishing to login to does not depend on Javascript then traversing the website is entirely possible. My suggestion would be to use Kenneth Reitz's requests module. You could do something like the following:

import requests

# To handle logins you'll most likely need to maintain a session
# if the site you login to usually expects a human
s = requests.session() # starts a session

# Next you want to login to the site
s.post("http://somesite.com/login", data={"u": "username", "p": "password"})

# Now you're logged in and you can do anything you want 
# using the session instance 
response_data = s.get("http://somesite.com/awesome-page-id-like-to-grab")

# Do something with the response data ...
my_response_parsing_function(response_data.content)

There are other ways to do it that involve python standard libraries but requests handles all that nitty-gritty stuff.

Upvotes: 0

sgallen
sgallen

Reputation: 2109

I'd suggest checking out mechanize for logging in: http://wwwsearch.sourceforge.net/mechanize/

For clicking buttons check out this answer: https://stackoverflow.com/a/1806266/1104941

Edit:

Additional useful links:

Upvotes: 4

Related Questions