redunicorn
redunicorn

Reputation: 49

What's the fastest way to check thousands of urls?

I need to check at least 20k urls to check if the url is up and save some data in a database.

I already know how to check if an url is online and how to save some data in the database. But without concurrency it will take ages to check all urls so whats the fastest way to check thousands of urls?

I am following this tutorial: https://realpython.com/python-concurrency/ and it seems that the "CPU-Bound multiprocessing Version" is the fastest way to do, but I want to know if that it is fastest way or if there are better options.

Edit:

Based on the replies I will update the post comparing Multiprocessing and Multithreading

Example 1: Print "Hello!" 40 times

Threading

Multiprocessing with 8 cores:

If you use 8 threads it will be better the threading

Example 2, the problem propounded in my question:

After several tests if you use more than 12 threads the threading will be faster. For example, if you want to test 40 urls and you use threading with 40 threads it will be 50% faster than multiprocessing with 8 cores

Thanks for your help

Upvotes: 1

Views: 1899

Answers (3)

Sam
Sam

Reputation: 533

I currently use Multiprocessing with Queues, it works fast enough for what I use it for.

Similarly to Artiom's solution above, I set the number of processes to 80 (currently), use the "workers" to pull the data, send this to the queues and once finished, go through the returned results and handle them depending on the queue.

Upvotes: 0

CircArgs
CircArgs

Reputation: 650

I think you should use pool:pool docs

Based on some results here: mp vs threading SO

I would say always use multiprocessing. Perhaps if you expect your requests to take a long time to resolve then the context switching benefits of threads would overcome the brute force of multiprocessing

Something like

import multiprocessing as mp
urls=['google.com', 'yahoo.com']

with mp.Pool(mp.cpu_count()) as pool:

        results=pool.map(fetch_data, urls)

Edit: to address comments about a set number of subprocesses I've shown how to request processes equal to your number of logical threads

Upvotes: 3

Artiom  Kozyrev
Artiom Kozyrev

Reputation: 3836

To say that multiprocessing is always the best choice is incorrect, multiprocessing is best only for heavy computations!

The best choice for actions which do not require heavy computations, but only IN/OUT operations like database requets or requests of remote webapp api, is module threading. Threading can be faster than multiprocessing since multiprocessing need to serialize data to send it to child process, meanwhile trheads use the same memory stack.

Threading module

Typical activity in the case is to create input queue.Queue and put task (urls in you case in it) and create several workers to take tasks from the Queue:

import threading as thr
from queue import Queue


def work(input_q):
    """the function take task from input_q and print or return with some code changes (if you want)"""
    while True:
        item = input_q.get()
        if item == "STOP":
            break

        # else do some work here
        print("some result")


if __name__ == "__main__":
    input_q = Queue()
    urls = [...]
    threads_number = 8
    workers = [thr.Thread(target=work, args=(input_q,),) for i in range(threads_number)]
    # start workers here
    for w in workers:
        w.start

    # start delivering tasks to workers 
    for task in urls:
        input_q.put(task)

    # "poison pillow" for all workers to stop them:

    for i in range(threads_number):
        input_q.put("STOP")

    # join all workers to main thread here:

    for w in workers:
        w.join

    # show that main thread can continue

    print("Job is done.")

Upvotes: 3

Related Questions