Reputation: 21
New programmer here. Tried to webscrape pof site as I was learning python. Tried to webscrape using the requests and beautiful soup. Thanks in advance
Error seems to be from the line res=requests.get('https://www.pof.com/everyoneonline.aspx?page_id=%s' %pageId)
I tried to remove the pagination page and scrape only one page, but it didn't work Also tried to use time.sleep 3 seconds between each request, but that didn't work either
#Username and password
username='MyUsername'
password='MyPassword'
#Login to pof site
from selenium import webdriver
import bs4,requests
browser = webdriver.Chrome(executable_path='/Users/Desktop/geckodriver-v0.24.0-win32/chromedriver.exe')
browser.get('https://www.pof.com')
linkElem= browser.find_element_by_link_text('Sign In')
linkElem.click()
usernameElem=browser.find_element_by_id('logincontrol_username')
usernameElem.send_keys(username)
passwordElem=browser.find_element_by_id('logincontrol_password')
passwordElem.send_keys(password)
passwordElem.submit()
#Webscraping online profile links from first 7 pagination pages
for pageId in range(7):
res=requests.get('https://www.pof.com/everyoneonline.aspx?page_id=%s' %pageId)
res.raise_for_status()
soup= bs4.BeautifulSoup(res.text)
profile = soup.findAll('div', attrs={'class' : 'rc'})
for div in profile:
print (div.findAll('a')['href'])
Expected result: Printing a list of all href links of profile, so I can later save them to a csv
Actual result:
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))enter code here
Upvotes: 2
Views: 4142
Reputation: 729
I'm gonna give you some general info when scraping webpages:
re
module if you already know regex. BeautifulSoup
is great, but for general purpose uses, re
is just easier in my experience.So now to answer your question; There are a lot of different webpages out there, but this is how i suggest scraping from all of them:
Network
section. Here you can see all the requests your browser is making, along with the headers and sources.GET
or in your case POST
method.:
like :method: POST
are not needed)headers = {
"accept": "application/json, text/javascript, */*; q=0.01",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"dnt": "1",
"origin": "https://stackoverflow.com",
"referer": "https://stackoverflow.com/questions/56399462/error-message-10054-when-wescraping-with-requests-module",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36",
}
Headers
section of the request, named something line "Payload" or "Form Data". Put it's contents in another python dict, and change it's contents as desired.Now you're ready to put the extracted data to use with python requests, then using re
or BeautifulSoup
on the response contents to extract your desired data.
In this example I'm logging in to https://aavtrain.com/index.asp
Try to follow the steps I've written and make sense of what's happening here:
import requests
username = "something"
password = "somethingelse"
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"accept-encoding": "gzip, deflate, br",
"cache-control": "max-age=0",
"content-type": "application/x-www-form-urlencoded",
"dnt": "1",
"origin": "https://aavtrain.com",
"referer": "https://aavtrain.com/index.asp",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
data = {
"user_name": username,
"password": password,
"Submit": "Submit",
"login": "true"
}
with requests.Session() as session:
session.get("https://aavtrain.com/index.asp")
loggedIn = session.post("https://aavtrain.com/index.asp", headers=headers, data=data)
#... do stuff after logged in..
I hope this helps, ask any lingering questions and i'll get back to you.
Upvotes: 1