Reputation: 33
I'm using a proxy service to cycle requests with different proxy ips for web scraping. Do I need to build in functionality to end requests so as to not overload the web server I'm scraping?
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import concurrent.futures
list_of_urls = ['https://www.example']
NUM_RETRIES = 3
NUM_THREADS = 5
def scrape_url(url):
params = {'api_key': 'API_KEY', 'url': url}
# send request to scraperapi, and automatically retry failed requests
for _ in range(NUM_RETRIES):
try:
response = requests.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except requests.exceptions.ConnectionError:
response = ''
## parse data if 200 status code (successful response)
if response.status_code == 200:
## do stuff
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_url, list_of_urls)
Upvotes: 0
Views: 396
Reputation: 128
Hi if you are using the latest version of requests, then most probably it is keeping the TCP connection alive. What you can do is to define a request class and set it up not to keep the connections alive and then proceed normally with you code
s = requests.session()
s.config['keep_alive'] = False
As discussed here, there really isn't such a thing as an HTTP connection and what httplib refers to as the HTTPConnection is really the underlying TCP connection which doesn't really know much about your requests at all. Requests abstracts that away and you won't ever see it.
The newest version of Requests does in fact keep the TCP connection alive after your request.. If you do want your TCP connections to close, you can just configure the requests to not use keep-alive.
Alternatively
s = requests.session(config={'keep_alive': False})
Updated version of your code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlencode
import concurrent.futures
list_of_urls = ['https://www.example']
NUM_RETRIES = 3
NUM_THREADS = 5
def scrape_url(url):
params = {'api_key': 'API_KEY', 'url': url}
s = requests.session()
s.config['keep_alive'] = False
# send request to scraperapi, and automatically retry failed requests
for _ in range(NUM_RETRIES):
try:
response = s.get('http://api.scraperapi.com/', params=urlencode(params))
if response.status_code in [200, 404]:
## escape for loop if the API returns a successful response
break
except requests.exceptions.ConnectionError:
response = ''
## parse data if 200 status code (successful response)
if response.status_code == 200:
## do stuff
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
executor.map(scrape_url, list_of_urls)
Upvotes: 1