Reputation: 22440
I've written a script in python using multiprocessing.pool.ThreadPool
to handle multiple requests concurrently and do the scraping process robust. The parser is doing it's job perfectly.
As I have noticed in several scripts that there should be a delay within the scraping process when it is created using multiprocessing, I would like to put a delay within my below script as well.
However, This is where I'm stuck and can't find out the right position to put that delay.
This is my script so far:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool
url = "http://srar.com/roster/index.php?agent_search=a"
def get_links(link):
completelinks = []
res = requests.get(link)
soup = BeautifulSoup(res.text,'lxml')
for items in soup.select("table.border tr"):
if not items.select("td a[href^='index.php?agent']"):continue
data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
completelinks.extend(data)
return completelinks
def get_info(nlink):
req = requests.get(nlink)
sauce = BeautifulSoup(req.text,"lxml")
for tr in sauce.select("table[style$='1px;'] tr")[1:]:
table = [td.get_text(strip=True) for td in tr.select("td")]
print(table)
if __name__ == '__main__':
ThreadPool(20).map(get_info, get_links(url))
Once again: all i need to know is the right position within my script to put a delay.
Upvotes: 2
Views: 2067
Reputation: 21664
For putting a delay in between your multiple requests.get()
calls, located within get_info
, you would have to expand get_info
with a delay-argument, which it can take as input to a time.sleep()
call. Since all your worker-threads start at once, your delays have to be cumulative for every call. Meaning, when you want your delay between the requests.get()
calls to be 0.5 seconds, your list of delays you are passing along into the pool-method would look like [0.0, 0.5, 1.0, 1.5, 2.0, 2.5 ...].
For not having to alter get_info
itself, I'm using a decorator in the example below to extend get_info
with a delay parameter and a time.sleep(delay)
call. Note that I'm passing the delays along the other argument for get_info
in the pool.starmap
call.
import logging
from multiprocessing.pool import ThreadPool
from functools import wraps
def delayed(func):
@wraps(func)
def wrapper(delay, *args, **kwargs):
time.sleep(delay) # <--
return func(*args, **kwargs)
return wrapper
@delayed
def get_info(nlink):
info = nlink + '_info'
logger.info(msg=info)
return info
def get_links(n):
return [f'link{i}' for i in range(n)]
def init_logging(level=logging.DEBUG):
fmt = '[%(asctime)s %(levelname)-8s %(threadName)s' \
' %(funcName)s()] --- %(message)s'
logging.basicConfig(format=fmt, level=level)
if __name__ == '__main__':
DELAY = 0.5
init_logging()
logger = logging.getLogger(__name__)
links = get_links(10) # ['link0', 'link1', 'link2', ...]
delays = (x * DELAY for x in range(0, len(links)))
arguments = zip(delays, links) # (0.0, 'link0'), (0.5, 'link1'), ...
with ThreadPool(10) as pool:
result = pool.starmap(get_info, arguments)
print(result)
Example Output:
[2018-10-03 22:04:14,221 INFO Thread-8 get_info()] --- link0_info
[2018-10-03 22:04:14,721 INFO Thread-5 get_info()] --- link1_info
[2018-10-03 22:04:15,221 INFO Thread-3 get_info()] --- link2_info
[2018-10-03 22:04:15,722 INFO Thread-4 get_info()] --- link3_info
[2018-10-03 22:04:16,223 INFO Thread-1 get_info()] --- link4_info
[2018-10-03 22:04:16,723 INFO Thread-6 get_info()] --- link5_info
[2018-10-03 22:04:17,224 INFO Thread-7 get_info()] --- link6_info
[2018-10-03 22:04:17,723 INFO Thread-10 get_info()] --- link7_info
[2018-10-03 22:04:18,225 INFO Thread-9 get_info()] --- link8_info
[2018-10-03 22:04:18,722 INFO Thread-2 get_info()] --- link9_info
['link0_info', 'link1_info', 'link2_info', 'link3_info', 'link4_info',
'link5_info', 'link6_info', 'link7_info', 'link8_info', 'link9_info']
Upvotes: 1