SIM
SIM

Reputation: 22440

How to delay execution within ThreadPool?

I've written a script in python using multiprocessing.pool.ThreadPool to handle multiple requests concurrently and do the scraping process robust. The parser is doing it's job perfectly.

As I have noticed in several scripts that there should be a delay within the scraping process when it is created using multiprocessing, I would like to put a delay within my below script as well.

However, This is where I'm stuck and can't find out the right position to put that delay.

This is my script so far:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from multiprocessing.pool import ThreadPool

url = "http://srar.com/roster/index.php?agent_search=a"

def get_links(link):
    completelinks = []
    res = requests.get(link)  
    soup = BeautifulSoup(res.text,'lxml')
    for items in soup.select("table.border tr"):
        if not items.select("td a[href^='index.php?agent']"):continue
        data = [urljoin(link,item.get("href")) for item in items.select("td a[href^='index.php?agent']")]
        completelinks.extend(data)
    return completelinks

def get_info(nlink):
    req = requests.get(nlink)
    sauce = BeautifulSoup(req.text,"lxml")
    for tr in sauce.select("table[style$='1px;'] tr")[1:]:
        table = [td.get_text(strip=True) for td in tr.select("td")]
        print(table)

if __name__ == '__main__':
    ThreadPool(20).map(get_info, get_links(url))

Once again: all i need to know is the right position within my script to put a delay.

Upvotes: 2

Views: 2067

Answers (1)

Darkonaut
Darkonaut

Reputation: 21664

For putting a delay in between your multiple requests.get() calls, located within get_info, you would have to expand get_info with a delay-argument, which it can take as input to a time.sleep() call. Since all your worker-threads start at once, your delays have to be cumulative for every call. Meaning, when you want your delay between the requests.get() calls to be 0.5 seconds, your list of delays you are passing along into the pool-method would look like [0.0, 0.5, 1.0, 1.5, 2.0, 2.5 ...].

For not having to alter get_info itself, I'm using a decorator in the example below to extend get_info with a delay parameter and a time.sleep(delay) call. Note that I'm passing the delays along the other argument for get_info in the pool.starmap call.

import logging
from multiprocessing.pool import ThreadPool
from functools import wraps

def delayed(func):
    @wraps(func)
    def wrapper(delay, *args, **kwargs):
        time.sleep(delay)  # <--
        return func(*args, **kwargs)
    return wrapper

@delayed
def get_info(nlink):
    info = nlink + '_info'
    logger.info(msg=info)
    return info


def get_links(n):
    return [f'link{i}' for i in range(n)]


def init_logging(level=logging.DEBUG):
    fmt = '[%(asctime)s %(levelname)-8s %(threadName)s' \
          ' %(funcName)s()] --- %(message)s'
    logging.basicConfig(format=fmt, level=level)


if __name__ == '__main__':

    DELAY = 0.5

    init_logging()
    logger = logging.getLogger(__name__)

    links = get_links(10) # ['link0', 'link1', 'link2', ...]
    delays = (x * DELAY for x in range(0, len(links)))
    arguments = zip(delays, links) # (0.0, 'link0'), (0.5, 'link1'), ...

    with ThreadPool(10) as pool:
        result = pool.starmap(get_info, arguments)
        print(result)

Example Output:

[2018-10-03 22:04:14,221 INFO     Thread-8 get_info()] --- link0_info
[2018-10-03 22:04:14,721 INFO     Thread-5 get_info()] --- link1_info
[2018-10-03 22:04:15,221 INFO     Thread-3 get_info()] --- link2_info
[2018-10-03 22:04:15,722 INFO     Thread-4 get_info()] --- link3_info
[2018-10-03 22:04:16,223 INFO     Thread-1 get_info()] --- link4_info
[2018-10-03 22:04:16,723 INFO     Thread-6 get_info()] --- link5_info
[2018-10-03 22:04:17,224 INFO     Thread-7 get_info()] --- link6_info
[2018-10-03 22:04:17,723 INFO     Thread-10 get_info()] --- link7_info
[2018-10-03 22:04:18,225 INFO     Thread-9 get_info()] --- link8_info
[2018-10-03 22:04:18,722 INFO     Thread-2 get_info()] --- link9_info
['link0_info', 'link1_info', 'link2_info', 'link3_info', 'link4_info', 
'link5_info', 'link6_info', 'link7_info', 'link8_info', 'link9_info']

Upvotes: 1

Related Questions