Michael T
Michael T

Reputation: 1955

Python POST Request Failing, [Errno 10054] An existing connection was forcibly closed by the remote host

I'm using Beautiful Soup to try to scrape a web page. The code worked great but now it is not working. I think the problem is, the source site changed their login page. So I replaced the loginurl and it is apparently not able to connect to that url. I can connect to it directly. So can someone try to run this and tell me what I'm doing wrong?

import requests
from bs4 import BeautifulSoup
import re
import pymysql
import datetime

myurl = 'http://www.cbssports.com'

loginurl = 'https://auth.cbssports.com/login/index'

try:
    response = requests.get(loginurl)
except requests.exceptions.ConnectionError as e:
    print "BAD DOMAIN"

payload = {  
   'dummy::login_form': 1,  
   'form::login_form': 'login_form',  
   'xurl': myurl,  
   'master_product': 150,  
   'vendor': 'cbssports',  
   'userid': 'myuserid',  
   'password': 'mypassword', 
   '_submit': 'Sign in' }

session = requests.session()
p = session.post(loginurl, data=payload)

#(code to scrape the web page)

I get the following error: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='auth.cbssports.com', port=443): Max retries exceeded with url: /login (Caused by : [Errno 10054] An existing connection was forcibly closed by the remote host)

Is the website actively blocking my automated login? Or do I have something wrong in the data payload?

Edit: Here's a simpler piece of code...

import requests

myurl = 'http://www.cbssports.com'

loginurl = 'https://auth.cbssports.com/login/index'

try:
    response = requests.get(myurl)
except requests.exceptions.ConnectionError as e:
    print "My URL is BAD"

try:
    response = requests.get(loginurl)
except requests.exceptions.ConnectionError as e:
    print "Login URL is BAD"

Note that the login url is bad but the main one is not. I am able to access both urls manually in a browser. So why is the login page not accessible via Python?

Upvotes: 1

Views: 4908

Answers (2)

Michael T
Michael T

Reputation: 1955

Ok I'm not sure why this worked, but I solved this by simply changing the https to http in the login address. And like magic, it worked. It appears that cbs has an unsecure version of the same page maybe (?).

Upvotes: 1

mattdennewitz
mattdennewitz

Reputation: 76

short answer: add a scheme (http://) to myurl (from www.cbssports.com to http://www.cbssports.com) before using it as the xurl post value.


longer answer: your session authentication and request code is fine. i believe the issue is that cbs's app is confused by your value for xurl, the parameter cbs reads to decide where to redirect a user after successful authentication). you're passing in a schemaless url, www.cbssports.com, which cbs is interpreting as a relative path - there is no http://cbssports.com/www.cbssports.com, so it (correctly, but confusingly) 404s. adding a scheme to make this an absolute url fixes this issue, giving you an authenticated session for all subsequent requests. huzzah!

however, i could not reproduce the connectionexception you experienced, which makes me wonder if that was network congestion rather than anti-scraping measures on cbs's side.

hope this is helpful.

Upvotes: 1

Related Questions