Rontron
Rontron

Reputation: 4233

Logging in to web scrape

I'm trying to web-scrape a page on www.roblox.com that requires me to be logged in. I have done this using the .ROBLOSECURITY cookie, however, that cookie changes every few days. I want to instead log in using the login form and Python. The form and what I have so far is below. I do NOT want to use any add-on libraries like mechanize or requests.

Form:

<form action="/newlogin" id="loginForm" method="post" novalidate="novalidate" _lpchecked="1">
    <div id="loginarea" class="divider-bottom" data-is-captcha-on="False">
        <div id="leftArea">
            <div id="loginPanel">
                <table id="logintable">
                    <tbody>
                        <tr id="username">
                            <td><label class="form-label" for="Username">Username:</label></td>
                            <td><input class="text-box text-box-medium valid" data-val="true" data-val-required="The Username field is required." id="Username" name="Username" type="text" value="" autocomplete="off" aria-required="true" aria-invalid="false" style="cursor: auto; background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGP6zwAAAgcBApocMXEAAAAASUVORK5CYII=);"></td>
                        </tr>
                        <tr id="password">
                            <td><label class="form-label" for="Password">Password:</label></td>
                            <td><input class="text-box text-box-medium" data-val="true" data-val-required="The Password field is required." id="Password" name="Password" type="password" autocomplete="off" style="cursor: auto; background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR4nGP6zwAAAgcBApocMXEAAAAASUVORK5CYII=);"></td>
                        </tr>
                    </tbody>
                </table>
                <div>
                </div>
                <div>
                    <div id="forgotPasswordPanel">
                        <a class="text-link" href="/Login/ResetPasswordRequest.aspx" target="_blank">Forgot your password?</a>
                    </div>
                    <div id="signInButtonPanel" data-use-apiproxy-signin="False" data-sign-on-api-path="https://api.roblox.com/login/v1">
                        <a roblox-js-onclick="" class="btn-medium btn-neutral">Sign In</a>
                        <a roblox-js-oncancel="" class="btn-medium btn-negative">Cancel</a>
                    </div>
                    <div class="clearFloats">
                    </div>
                </div>
                <span id="fb-root">
                    <div id="SplashPageConnect" class="fbSplashPageConnect">
                        <a class="facebook-login" href="/Facebook/SignIn?returnTo=/home" ref="form-facebook">
                            <span class="left"></span>
                            <span class="middle">Login with Facebook<span>Login with Facebook</span></span>
                            <span class="right"></span>
                        </a>
                    </div>
                </span>
            </div>
        </div>
        <div id="rightArea" class="divider-left">
            <div id="signUpPanel" class="FrontPageLoginBox">
                <p class="text">Not a member?</p>
                <h2>Sign Up to Build &amp; Make Friends</h2>
                <a roblox-js-onsignup="" class="btn-medium btn-primary">Sign Up</a>
            </div>
        </div>
    </div>
    <input id="ReturnUrl" name="ReturnUrl" type="hidden" value="">
</form>

What I have so far:

import cookielib
import urllib
import urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

opener.addheaders = [('User-agent', 'Mozilla/5.0')]

urllib2.install_opener(opener)

authentication_url = 'http://www.roblox.com/newlogin'

payload = {
    'ReturnUrl' : 'http://www.roblox.com/home',
    'Username' : 'usernamehere',
    'Password' : 'passwordhere'
    }

data = urllib.urlencode(payload)

req = urllib2.Request(authentication_url, data)

resp = urllib2.urlopen(req)
contents = resp.read()
print contents

What is wrong with my code; I only get the log in page when I print contents

PS: The login page is HTTPS

Upvotes: 1

Views: 562

Answers (2)

Kyrubas
Kyrubas

Reputation: 897

I made this class a few weeks ago using just urllib.request for some webscraping/autotab opening. This may help you out or perhaps get you on the right path.

import urllib.request
class Log_in:
    def __init__(self, loginURL, username, password):
        self.loginURL = loginURL
        self.username = username
        self.password = password
    def log_in_to_site(self):
        auth_handler = urllib.request.HTTPBasicAuthHandler()
        auth_handler.add_password(realm = None,
                                  uri=self.loginURL,
                                  user=self.username,
                                  passwd=self.password)
        opener = urllib.request.build_opener(auth_handler)
        urllib.request.install_opener(opener)

Upvotes: 1

Cœur
Cœur

Reputation: 38667

Solution from OP.

I finished the script myself with the code below:

import cookielib
import urllib
import urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

opener.addheaders = [('User-agent', 'Mozilla/5.0')]

urllib2.install_opener(opener)

authentication_url = 'https://www.roblox.com/newlogin'

payload = {
    'username' : 'YourUsernameHere',
    'password' : 'YourPasswordHere',
    '' : 'Log In',
    }

data = urllib.urlencode(payload)

req = urllib2.Request(authentication_url, data)

resp = urllib2.urlopen(req)
PageYouWantToOpen = urllib2.urlopen("http://www.roblox.com/develop").read()

Upvotes: 1

Related Questions