Reputation: 169
I checked this question but it only has one answer and it's a little over my head (just started with Python). I'm using Python 3.
I'm trying to scrape data from this page, but if you have a BP account, the page is a lot different/more useful. I need my program to log me in before I have BeautifulSoup get the data for me.
So far I have
from bs4 import BeautifulSoup
import urllib.request
import requests
username = 'myUsername'
password = 'myPassword'
from requests import session
payload = {'action': 'Log in',
'Username: ': username,
'Password: ': password}
# the next 3 lines are pretty much copied from a different StackOverflow
# question. I don't really understand what they're doing, and obviously these
# are where the problem is.
with session() as c:
c.post('https://www.baseballprospectus.com/manageprofile.php', data=payload)
response = c.get('http://www.baseballprospectus.com/sortable/index.php?cid=1820315')
soup = BeautifulSoup(response.content, "lxml")
for row in soup.find_all('tr')[7:]:
cells = row.find_all('td')
name = cells[1].text
print(name)
The script does work, it just pulls the data from the site before it's logged in, so its not the data I want.
Upvotes: 1
Views: 1375
Reputation: 377
Conceptually, there is no problem with your code. You're using a session object to send a login request, then with the same session you're sending a request for the desired page. This means that the cookies set by the login request should be kept for the second request. If you want to read more about the workings of the Session object, here's the relevant Requests documentation.
Since I don't have a valid login for Baseball Prospectus, I'll have to guess that something is wrong with the data you're sending to the login page. A quick inspection using the 'Network' tab in Chrome's Developer Tools, shows that the login page, manageprofile.php, accepts four POST parameters:
username: myUsername
password: myPassword
action: muffinklezmer
nocache: some long number, e.g. 2417395155
However you're sending a different set of parameters, and specifying a different value for the 'action' parameter. Note that the parameter names have to match the original request exactly, otherwise manageprofile.php will not accept the login.
Try replacing the payload dictionary with this version:
payload = {
'action': 'muffinklezmer',
'username': username,
'password': password}
If this doesn't work, try adding the 'nocache' parameter too, e.g.:
'nocache': '1437955145'
Upvotes: 2