Reputation: 3049
I am using python3 urllib3 in order to build a crawler to download multiple urls.
On my main activity i create 20 threads of that using the same (one) instance of my Downloader
class which uses one instance of PoolManager
:
def __init__(self):
self.manager = PoolManager(num_pools=20)
I've tried submitting the same url over and over again and i see at the log that it creates a lot of connections to the same domain. I've tried to limit number of pools (num_pools=1
) and it still creating multiple connections to the same url. On the documentation i understood that the PoolManager
creates a new connection if the other connections to the same domain are being used.
I want to limit the number of connection to a single domain. Using up to 2 different connections is what a normal browser use so it is safe. How can i do that?
Upvotes: 0
Views: 805
Reputation: 18197
PoolManager(num_pools=20)
will limit the pool to 20 cached instances of ConnectionPools, each representing one domain usually. So you're effectively limiting to 20 cached domain pools, the per-domain connections are one level deeper.
We can specify the limit per ConnectionPool with maxsize=20
. Since you're using the pool to throttle your crawler, you'll also want to use block=True
which will prevent creating additional connections outside of the limit. With block=False
(the default), more connections will be created as-needed, but ones beyond the maxsize will not be saved for re-use.
Altogether, you probably want:
def __init__(self):
self.manager = PoolManager(maxsize=20, block=True)
A bit more documentation on which parameters are available:
Upvotes: 1