Reputation: 4924
I'm trying to scrape a website with BeautifulSoup. The site in question requires me being logged in. Please have a look on my code.
from bs4 import BeautifulSoup as bs
import requests
import sys
user = 'user'
password = 'pass'
# Url to login page
url = 'main url'
# Starts a session
session = requests.session(config={'verbose': sys.stderr})
login_data = {
'loginuser': user,
'loginpswd': password,
'submit': 'login',
}
r = session.post(url, data=login_data)
# Accessing a page to scrape
r = session.get('specific url')
soup = bs(r.content)
I came up with this code after having seen some threads here, at SO so I guess it should be valid, but the content printed is still as if I was logged out.
When I run this code, this is printed:
2013-05-10T22:49:45.882000 POST >the main url to login<
2013-05-10T22:49:46.676000 GET >error page of the main url page as if the logging in failed<
2013-05-10T22:49:46.761000 GET >the specific url<
Of course, login details are correct. Need some help guys.
@EDIT
How would I implement headers into the above?
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
Upvotes: 1
Views: 2014
Reputation: 28747
First of all you should not be using any version of requests older than 1.2.0. We simply won't support them if you find bugs (which you might).
Second, what you're likely looking for is this:
import requests
from requests.packages.urllib3 import add_stderr_logger
add_stderr_logger()
s = requests.Session()
s.headers['User-Agent'] = 'Mozilla/5.0'
# after examining the HTML of the website you're trying to log into
# set name_form to the name of the form element that contains the name and
# set password_form to the name of the form element that will contain the password
login = {name_form: username, password_form: password}
login_response = s.post(url, data=login)
for r in login_response.history:
if r.status_code == 401: # 401 means authentication failed
sys.exit(1) # abort
pdf_response = s.get(pdf_url) # Your cookies and headers are automatically included
I commented the code to help you. You can also try @FastTurtle's suggestion of using HTTP Basic Auth, but if you're trying to post to a form in the first place, you can continue trying to do it the way I described above. Also make sure that loginuser
and loginpswd
are the correct form element names. If they're not, that could be the potential issue here.b
Upvotes: 3
Reputation: 2311
The requests
module has support for several types of authentication. With any luck the website you are trying to parse uses HTTP Basic Auth, in which case it's pretty easy to send credentials.
This example is taken from the the requests website. You can read more on authentication with requests here and headers here.
s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})
# both 'x-test' and 'x-test2' are sent
s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})
Upvotes: 1