Montoya
Montoya

Reputation: 3049

Python3 urllib3 crawler - can't limit max connections to aa single domain

I am using python3 urllib3 in order to build a crawler to download multiple urls.

On my main activity i create 20 threads of that using the same (one) instance of my Downloader class which uses one instance of PoolManager:

def __init__(self):
    self.manager = PoolManager(num_pools=20)

I've tried submitting the same url over and over again and i see at the log that it creates a lot of connections to the same domain. I've tried to limit number of pools (num_pools=1) and it still creating multiple connections to the same url. On the documentation i understood that the PoolManager creates a new connection if the other connections to the same domain are being used.

I want to limit the number of connection to a single domain. Using up to 2 different connections is what a normal browser use so it is safe. How can i do that?

Upvotes: 0

Views: 805

Answers (1)

shazow
shazow

Reputation: 18197

PoolManager(num_pools=20) will limit the pool to 20 cached instances of ConnectionPools, each representing one domain usually. So you're effectively limiting to 20 cached domain pools, the per-domain connections are one level deeper.

We can specify the limit per ConnectionPool with maxsize=20. Since you're using the pool to throttle your crawler, you'll also want to use block=True which will prevent creating additional connections outside of the limit. With block=False (the default), more connections will be created as-needed, but ones beyond the maxsize will not be saved for re-use.

Altogether, you probably want:

def __init__(self):
    self.manager = PoolManager(maxsize=20, block=True)

A bit more documentation on which parameters are available:

Upvotes: 1

Related Questions