Reputation: 49
I need to check at least 20k urls to check if the url is up and save some data in a database.
I already know how to check if an url is online and how to save some data in the database. But without concurrency it will take ages to check all urls so whats the fastest way to check thousands of urls?
I am following this tutorial: https://realpython.com/python-concurrency/ and it seems that the "CPU-Bound multiprocessing Version" is the fastest way to do, but I want to know if that it is fastest way or if there are better options.
Edit:
Based on the replies I will update the post comparing Multiprocessing and Multithreading
Example 1: Print "Hello!" 40 times
Threading
Multiprocessing with 8 cores:
If you use 8 threads it will be better the threading
Example 2, the problem propounded in my question:
After several tests if you use more than 12 threads the threading will be faster. For example, if you want to test 40 urls and you use threading with 40 threads it will be 50% faster than multiprocessing with 8 cores
Thanks for your help
Upvotes: 1
Views: 1899
Reputation: 533
I currently use Multiprocessing with Queues, it works fast enough for what I use it for.
Similarly to Artiom's solution above, I set the number of processes to 80 (currently), use the "workers" to pull the data, send this to the queues and once finished, go through the returned results and handle them depending on the queue.
Upvotes: 0
Reputation: 650
I think you should use pool:pool docs
Based on some results here: mp vs threading SO
I would say always use multiprocessing. Perhaps if you expect your requests to take a long time to resolve then the context switching benefits of threads would overcome the brute force of multiprocessing
Something like
import multiprocessing as mp
urls=['google.com', 'yahoo.com']
with mp.Pool(mp.cpu_count()) as pool:
results=pool.map(fetch_data, urls)
Edit: to address comments about a set number of subprocesses I've shown how to request processes equal to your number of logical threads
Upvotes: 3
Reputation: 3836
To say that multiprocessing is always the best choice is incorrect, multiprocessing is best only for heavy computations!
The best choice for actions which do not require heavy computations, but only IN/OUT operations like database requets or requests of remote webapp api, is module threading. Threading can be faster than multiprocessing since multiprocessing need to serialize data to send it to child process, meanwhile trheads use the same memory stack.
Typical activity in the case is to create input queue.Queue and put task (urls in you case in it) and create several workers to take tasks from the Queue:
import threading as thr
from queue import Queue
def work(input_q):
"""the function take task from input_q and print or return with some code changes (if you want)"""
while True:
item = input_q.get()
if item == "STOP":
break
# else do some work here
print("some result")
if __name__ == "__main__":
input_q = Queue()
urls = [...]
threads_number = 8
workers = [thr.Thread(target=work, args=(input_q,),) for i in range(threads_number)]
# start workers here
for w in workers:
w.start
# start delivering tasks to workers
for task in urls:
input_q.put(task)
# "poison pillow" for all workers to stop them:
for i in range(threads_number):
input_q.put("STOP")
# join all workers to main thread here:
for w in workers:
w.join
# show that main thread can continue
print("Job is done.")
Upvotes: 3