New Programmer
New Programmer

Reputation: 21

Error message 10054 when wescraping with requests module

New programmer here. Tried to webscrape pof site as I was learning python. Tried to webscrape using the requests and beautiful soup. Thanks in advance

Error seems to be from the line res=requests.get('https://www.pof.com/everyoneonline.aspx?page_id=%s' %pageId)

I tried to remove the pagination page and scrape only one page, but it didn't work Also tried to use time.sleep 3 seconds between each request, but that didn't work either

#Username and password 
username='MyUsername'
password='MyPassword'


#Login to pof site
from selenium import webdriver
import bs4,requests
browser = webdriver.Chrome(executable_path='/Users/Desktop/geckodriver-v0.24.0-win32/chromedriver.exe')
browser.get('https://www.pof.com')
linkElem= browser.find_element_by_link_text('Sign In')
linkElem.click()
usernameElem=browser.find_element_by_id('logincontrol_username')
usernameElem.send_keys(username)
passwordElem=browser.find_element_by_id('logincontrol_password')
passwordElem.send_keys(password)
passwordElem.submit()

#Webscraping online profile links from first 7 pagination pages
for pageId in range(7):
    res=requests.get('https://www.pof.com/everyoneonline.aspx?page_id=%s' %pageId)
    res.raise_for_status()
    soup= bs4.BeautifulSoup(res.text)
    profile = soup.findAll('div', attrs={'class' : 'rc'})
    for div in profile:
        print (div.findAll('a')['href'])

Expected result: Printing a list of all href links of profile, so I can later save them to a csv

Actual result: requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))enter code here

Upvotes: 2

Views: 4142

Answers (1)

Xosrov
Xosrov

Reputation: 729

I'm gonna give you some general info when scraping webpages:

  1. First of all, don't use requests with selenium together! In my experience requests is the fastest and easiest solution 90% of the time.
  2. Always try to provide headers to your request. Providing no headers causes the webpage to be suspicious and it might even block all your requests(the error you're getting could be because of this!)
  3. For subsequent requests to the webpage, use a session!, this way your cookies get stored and you can actually access the logged in page for a long period of time.
  4. This one is more objective, but i suggest using the re module if you already know regex. BeautifulSoup is great, but for general purpose uses, re is just easier in my experience.

So now to answer your question; There are a lot of different webpages out there, but this is how i suggest scraping from all of them:

Extracting Data~


Header data

  • Open up your usual browser with inspect element support. Go to the webpage you're trying to scrape from and open the inspect element dock.
  • Go onto the Network section. Here you can see all the requests your browser is making, along with the headers and sources.
  • Make the request you want to emulate, keep a track of the network tab, go to the request that contains the desired GET or in your case POST method.
  • Copy the request headers for that particular request. You don't need all of them(for example the cookie parameter will be added by the session so it's not needed for this example; Also headers starting with : like :method: POST are not needed)
  • Put the copied headers from your browser to a python dict, here's an example from this very webpage:
headers = {
"accept": "application/json, text/javascript, */*; q=0.01",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"dnt": "1",
"origin": "https://stackoverflow.com",
"referer": "https://stackoverflow.com/questions/56399462/error-message-10054-when-wescraping-with-requests-module",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36",
}

Post data

  • If you want to make a post request, there should be another section on the Headers section of the request, named something line "Payload" or "Form Data". Put it's contents in another python dict, and change it's contents as desired.

Using Data~


Now you're ready to put the extracted data to use with python requests, then using re or BeautifulSoup on the response contents to extract your desired data.
In this example I'm logging in to https://aavtrain.com/index.asp
Try to follow the steps I've written and make sense of what's happening here:

import requests
username = "something"
password = "somethingelse"
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"accept-encoding": "gzip, deflate, br",
"cache-control": "max-age=0",
"content-type": "application/x-www-form-urlencoded",
"dnt": "1",
"origin": "https://aavtrain.com",
"referer": "https://aavtrain.com/index.asp",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
data = {
"user_name": username,
"password": password,
"Submit": "Submit",
"login": "true"
}
with requests.Session() as session:
    session.get("https://aavtrain.com/index.asp")
    loggedIn = session.post("https://aavtrain.com/index.asp", headers=headers, data=data)
    #... do stuff after logged in..

I hope this helps, ask any lingering questions and i'll get back to you.

Upvotes: 1

Related Questions