jbf
jbf

Reputation: 169

scraping data from webpage with python 3, need to log in first

I checked this question but it only has one answer and it's a little over my head (just started with Python). I'm using Python 3.

I'm trying to scrape data from this page, but if you have a BP account, the page is a lot different/more useful. I need my program to log me in before I have BeautifulSoup get the data for me.

So far I have

from bs4 import BeautifulSoup
import urllib.request 
import requests

username = 'myUsername'
password = 'myPassword'

from requests import session

payload = {'action': 'Log in',
       'Username: ': username,
       'Password: ': password}

# the next 3 lines are pretty much copied from a different StackOverflow
# question. I don't really understand what they're doing, and obviously these 
# are where the problem is.

with session() as c:
    c.post('https://www.baseballprospectus.com/manageprofile.php', data=payload)
    response = c.get('http://www.baseballprospectus.com/sortable/index.php?cid=1820315')

soup = BeautifulSoup(response.content, "lxml")

for row in soup.find_all('tr')[7:]:
    cells = row.find_all('td')
    name = cells[1].text
    print(name)

The script does work, it just pulls the data from the site before it's logged in, so its not the data I want.

Upvotes: 1

Views: 1375

Answers (1)

matangover
matangover

Reputation: 377

Conceptually, there is no problem with your code. You're using a session object to send a login request, then with the same session you're sending a request for the desired page. This means that the cookies set by the login request should be kept for the second request. If you want to read more about the workings of the Session object, here's the relevant Requests documentation.

Since I don't have a valid login for Baseball Prospectus, I'll have to guess that something is wrong with the data you're sending to the login page. A quick inspection using the 'Network' tab in Chrome's Developer Tools, shows that the login page, manageprofile.php, accepts four POST parameters:

username: myUsername
password: myPassword
action: muffinklezmer
nocache: some long number, e.g. 2417395155

However you're sending a different set of parameters, and specifying a different value for the 'action' parameter. Note that the parameter names have to match the original request exactly, otherwise manageprofile.php will not accept the login.

Try replacing the payload dictionary with this version:

payload = {
       'action': 'muffinklezmer',
       'username': username,
       'password': password}

If this doesn't work, try adding the 'nocache' parameter too, e.g.:

'nocache': '1437955145'

Upvotes: 2

Related Questions