nutship
nutship

Reputation: 4924

Scraping a website that requires being logged in

I'm trying to scrape a website with BeautifulSoup. The site in question requires me being logged in. Please have a look on my code.

from bs4 import BeautifulSoup as bs
import requests
import sys

user = 'user'
password = 'pass'

# Url to login page
url = 'main url'

# Starts a session
session = requests.session(config={'verbose': sys.stderr})

login_data = {
'loginuser': user,
'loginpswd': password,
'submit': 'login',
}

r = session.post(url, data=login_data)

# Accessing a page to scrape
r = session.get('specific url')
soup = bs(r.content)

I came up with this code after having seen some threads here, at SO so I guess it should be valid, but the content printed is still as if I was logged out.

When I run this code, this is printed:

2013-05-10T22:49:45.882000   POST   >the main url to login<
2013-05-10T22:49:46.676000   GET    >error page of the main url page as if the logging in failed<
2013-05-10T22:49:46.761000   GET    >the specific url<

Of course, login details are correct. Need some help guys.

@EDIT

How would I implement headers into the above?

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

Upvotes: 1

Views: 2014

Answers (2)

Ian Stapleton Cordasco
Ian Stapleton Cordasco

Reputation: 28747

First of all you should not be using any version of requests older than 1.2.0. We simply won't support them if you find bugs (which you might).

Second, what you're likely looking for is this:

import requests
from requests.packages.urllib3 import add_stderr_logger

add_stderr_logger()
s = requests.Session()

s.headers['User-Agent'] = 'Mozilla/5.0'

# after examining the HTML of the website you're trying to log into
# set name_form to the name of the form element that contains the name and
# set password_form to the name of the form element that will contain the password
login = {name_form: username, password_form: password}
login_response = s.post(url, data=login)
for r in login_response.history:
    if r.status_code == 401:  # 401 means authentication failed
        sys.exit(1)  # abort

pdf_response = s.get(pdf_url)  # Your cookies and headers are automatically included

I commented the code to help you. You can also try @FastTurtle's suggestion of using HTTP Basic Auth, but if you're trying to post to a form in the first place, you can continue trying to do it the way I described above. Also make sure that loginuser and loginpswd are the correct form element names. If they're not, that could be the potential issue here.b

Upvotes: 3

FastTurtle
FastTurtle

Reputation: 2311

The requests module has support for several types of authentication. With any luck the website you are trying to parse uses HTTP Basic Auth, in which case it's pretty easy to send credentials.

This example is taken from the the requests website. You can read more on authentication with requests here and headers here.

s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})

# both 'x-test' and 'x-test2' are sent
s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})

Upvotes: 1

Related Questions