Reputation: 117
I've been working on an multithreaded image scraper in Python using requests
and multiprocessing.dummy
.
The script runs fine until it reaches a certain point. Then the whole workflow gets really slow. Also it seems like the more threads I'll use the earlier I experience this.
The download part looks like:
def download(URL):
try:
URL = URL.rstrip()
down = requests.get(URL, headers={'x-test2': 'true'})
# Download Images
except BaseException as e:
print("Error")
The threading part looks like:
if __name__ == '__main__':
ThreadPool(20).map(download, URLlist)
So my question is what is slowing down my whole download process since the urls are fine and it should continue just like it did before. Is there any command I'm missing or is it something with my threading part? (threads aren't closed correctly...)
Also important is that this problem doesn't appear with a smaller url list.
( But It shouldn't be an request limit problem with the page I download from because meanwhile the script is running and after I expierence 0 problems in terms of page speed & availability ). Why is that?
Upvotes: 0
Views: 131
Reputation: 2560
If pool operations slow down over a period of time, closing the pool every so often might (or might not) help. Try something simple like this...
if __name__ == '__main__':
max_size = # use some large value here
for i in range(0, len(URLlist), max_size):
st = time.time()
pool = ThreadPool(20)
pool.map(download, URLlist[i: i + max_size])
pool.close() # should not be needed in practice
pool.join()
et = time.time()
print('Processing took %.3f seconds' % (et-st))
Try some different, but large values for max_size. This is the number elements from the URLlist that your code will process before closing down the pool and opening another one.
As I said in my comment, I'm aware of this issue for multiprocessing.Pool() but I'm not certain that ThreadPool() has the same issue. For mp.Pool(), this only happens with extremely large lists of items to process. When this happens you typically see memory usage continually increase as the program runs (so look for this). I believe the underlying issue is that pool workers get created over and over but not correctly garbage-collected until you close the pool.
One other thing to consider... It may be possible that some URLs take a long time to process and after your code runs for a while, many of your threads may be bogged down with the slower URLs making things appear to slow down overall. If that's the case, closing the pool occasionally won't help.
Upvotes: 1