Scraping a website that requires being logged in

Question

I'm trying to scrape a website with BeautifulSoup. The site in question requires me being logged in. Please have a look on my code.

from bs4 import BeautifulSoup as bs
import requests
import sys

user = 'user'
password = 'pass'

# Url to login page
url = 'main url'

# Starts a session
session = requests.session(config={'verbose': sys.stderr})

login_data = {
'loginuser': user,
'loginpswd': password,
'submit': 'login',
}

r = session.post(url, data=login_data)

# Accessing a page to scrape
r = session.get('specific url')
soup = bs(r.content)

I came up with this code after having seen some threads here, at SO so I guess it should be valid, but the content printed is still as if I was logged out.

When I run this code, this is printed:

2013-05-10T22:49:45.882000   POST   >the main url to login<
2013-05-10T22:49:46.676000   GET    >error page of the main url page as if the logging in failed<
2013-05-10T22:49:46.761000   GET    >the specific url<

Of course, login details are correct. Need some help guys.

@EDIT

How would I implement headers into the above?

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

Ian Stapleton Cordasco · Accepted Answer

First of all you should not be using any version of requests older than 1.2.0. We simply won't support them if you find bugs (which you might).

Second, what you're likely looking for is this:

import requests
from requests.packages.urllib3 import add_stderr_logger

add_stderr_logger()
s = requests.Session()

s.headers['User-Agent'] = 'Mozilla/5.0'

# after examining the HTML of the website you're trying to log into
# set name_form to the name of the form element that contains the name and
# set password_form to the name of the form element that will contain the password
login = {name_form: username, password_form: password}
login_response = s.post(url, data=login)
for r in login_response.history:
    if r.status_code == 401:  # 401 means authentication failed
        sys.exit(1)  # abort

pdf_response = s.get(pdf_url)  # Your cookies and headers are automatically included

I commented the code to help you. You can also try @FastTurtle's suggestion of using HTTP Basic Auth, but if you're trying to post to a form in the first place, you can continue trying to do it the way I described above. Also make sure that loginuser and loginpswd are the correct form element names. If they're not, that could be the potential issue here.b

Scraping a website that requires being logged in

Answers (2)

Related Questions