2705114-john
2705114-john

Reputation: 762

log in to webpage with python to scrape data

I am trying to build a webscraper to extract my stats data from MWO Mercs. To do so it is necessary to login to the page and then go through the 6 different stats pages to get the data (this will go into a data base later but that is not my question).

The login form is given below (from https://mwomercs.com/login?return=/profile/stats?type=mech)- from what I see there are two fields that need data EMAIL and PASSWORD and need to be posted. It should then open http://mwomercs.com/profile/stats?type=mech . After that I need have a session to cycle through the various stats pages.

I have tried using urllib, mechanize and requests but I have been totally unable to find the right answer - I would prefer to use requests.

I do realise that similar questions have been asked in stackoverflow but I have searched for a very long time with no success.

Thank you for any help that could be provided

<div id="stubPage">
    <div class="container">
        <h1 id="stubPageTitle">LOGIN</h1>
        <div id="loginForm">
            <form action="/do/login" method="post">

                <legend>MechWarrior Online <a href="/signup" class="btn btn-warning pull-right">REGISTER</a></legend>


                <label>Email Address:</label>
                <div class="input-prepend"><span class="add-on textColorBlack textPlain">@</span><input id="email" name="email" class="span4" size="16" type="text" placeholder="[email protected]"></div>

                <label>Password:</label>

                <div class="input-prepend"><span class="add-on"><span class="icon-lock"></span></span><input id="password" name="password" class="span4" size="16" type="password"></div>

                <br>
                <button type="submit" class="btn btn-large btn-block btn-primary">LOGIN</button>

                <br>
                <span class="pull-right">[ <a href="#" id="forgotLink">Forgot Your Password?</a> ]</span>

                <br>
                <input type="hidden" name="return" value="/profile/stats?type=mech">
            </form>
        </div>
    </div>
</div>

Upvotes: 1

Views: 976

Answers (1)

Amarok
Amarok

Reputation: 910

The Requests documentation is very simple and easy to follow when it comes to submitting form data. Please give this a read-through: More Complicated POST requests

Logins usually come down to saving the cookie and sending it with future requests.

After you POST to the login page with requests.post(), use the request object to retieve the cookies. This is one way to do it:

post_headers = {'content-type': 'application/x-www-form-urlencoded'}
payload = {'username':username, 'password':password}
login_request = requests.post(login_url, data=payload, headers=post_headers)
cookie_dict = login_request.cookies.get_dict()
stats_reqest = requests.get(stats_url, cookies=cookie_dict)

If you still have problems, check the return code from the request with login_request.status_code or the page content for an error with login_request.text

Edit:

Some sites will redirect you several times when you make a request. Make sure to check the request.history object to see what happened and why you got bounced out. For example, I get redirects like this all of the time:

>>> some_request.history
(<Response [302]>, <Response [302]>)

Each item in the history tuple is another request. You can inspect them like normal requests objects, such as request.history[0].url and you can disable the redirects by putting allow_redirects=False in your request parameters:

login_request = requests.post(login_url, data=payload, headers=post_headers, allow_redirects=False)

In some cases, I've had to disallow redirects and add new cookies before progressing to the proper page. Try using something like this to keep your existing cookies and add the new cookies to it:

cookie_dict = dict(cookie_dict.items() + new_request.cookies.get_dict().items())

Doing this after each request will keep your cookies up-to-date for your next request, similar to how your browser would.

Upvotes: 1

Related Questions