Reputation: 5462
So, I'm trying to automate the downloading of images from a website for which you have to login. The login form is on every page (in the browser you click "login" and a javascript slidedown occurs revealing the form). I login using the below code and when I get to agent.get( "http://cdn.com/some_image.jpg" ), a 403 error is thrown. This doesn't happen when I login into the browser and visit "http://cdn.com/some_image.jpg", so what is going on and how can I get around it?
path = "http://www.example.com/some_path"
agent = Mechanize.new
page = agent.get(path) do |page|
form = page.form_with(action: "http://www.example.com/authorize")
username_field = form.field_with(name: "username")
username_field.value = "some_user"
password_field = form.field_with(name: "password")
password_field.value = "password"
form.submit
end
agent.get( "http://cdn.com/some_image.jpg" ).save "some_image.jpg" unless File.exist?("some_image.jpg")
Upvotes: 0
Views: 661
Reputation: 55002
From a cdn I would guess they're checking user-agent or referer.
Mechanize should be setting the referer properly, so that leaves user-agent.
Upvotes: 1
Reputation: 16304
Think about this: you submitted a login request, and then a request for the image. How does the server know that you are the person that logged in from the first request? Tracking by IP (could be shared or a proxy), port (wouldn't tpyically survive multiple requests), user agent (not unique), etc obviously wouldn't work. Typically login sessions are implemented using cookies - a web client is given a session token in the form of a cookie, which, when presented back to the server in a subsequent request, informs the server of the session to which the request belongs, thus allowing the server to track logins across what are otherwise stateless web requests.
There are other methods, but they mostly resolve around passing this token in another way ( custom header, GET URL parameters, etc ) - with the notable exception of signed web requests such as AWS uses (cool, but not very common for web logins). All in all, session cookies are by far the most common implementation.
Thus, I suggest you take a look at this post, as there seems to be a method of managing cookies within the mechanize gem for use with subsequent requests.
Maintaining cookies between Mechanize requests
Upvotes: 1