Mauro Ciancio
Mauro Ciancio

Reputation: 436

.NET's HttpClient Throttling

I'm developing a .NET4-based application that has to request third-party servers in order to get information from them. I'm using HttpClient to make these HTTP requests.

I have to create a hundred or a thousand requests in a short period of time. I would like to throttle the creation of these request to a limit (defined by a constant or something) so the other servers don't receive a lot of requests.

I've checked this link out that shows how to reduce the amount of tasks created at any time.

Here is my non-working approach:

// create the factory
var factory = new TaskFactory(new LimitedConcurrencyLevelTaskScheduler(level));

// use the factory to create a new task that will create the request to the third-party server
var task = factory.StartNew(() => {
    return new HttpClient().GetAsync(url);
}).Unwrap();

Of course, the problem here is that even that one task at the time is created, a lot of requests will be created and processed at the same time, because they run in another scheduler. I could not find the way to change the scheduler to the HttpClient.

How should I handle this situation? I would like limit the amount of request created to a certain limit but do not block waiting for these request to finish.

Is this possible? Any ideas?

Upvotes: 6

Views: 1805

Answers (4)

Darrel Miller
Darrel Miller

Reputation: 142164

You might consider creating a new DelegatingHandler to sit in the request/response pipeline of the HTTPClient that could keep count of the the number of pending requests.

Generally a single HTTPClient instance is used to process multiple requests. Unlike HttpWebRequest, disposing a HttpClient instance closes the underlying TCP/IP connection, so if you want to reuse connections you really need to re-use HTTPClient instances.

Upvotes: 0

svick
svick

Reputation: 244908

If you can use .Net 4.5, one way would be to use TransformBlock from TPL Dataflow and set its MaxDegreeOfParallelism. Something like:

var block = new TransformBlock<string, byte[]>(
    url => new HttpClient().GetByteArrayAsync(url),
    new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = level });

foreach (var url in urls)
    block.Post(url);

block.Complete();

var result = new List<byte[]>();

while (await block.OutputAvailableAsync())
    result.Add(block.Receive());

There is also another way of looking at this, through ServicePointManager. Using that class, you can set limits on MaxServicePoints (how many servers can you be connected to at once) and DefaultConnectionLimit (how many connections can there be to each server). This way, you could start all your Tasks at the same moment, but only a limited amount of them would actually do something. Although limiting the number of Tasks (e.g. by using TPL Dataflow, as I suggested above) will be most likely more efficient.

Upvotes: 1

seand
seand

Reputation: 5296

You might consider launching a fixed set of threads. Each thread does the client net operations serially; maybe also pausing at certain points in order to throttle. This will give you specific control over loading; you can change your throttle policies and change the number of threads.

Upvotes: 0

John Zabroski
John Zabroski

Reputation: 2357

First, you should consider partitioning the workload according to website, or at least expose an abstraction that lets you choose how to partition the list of urls. e.g., one strategy could be by second-level domain e.g. yahoo.com, google.com.

The other thing is that if you are doing serious crawling, you may want to consider doing it on a cloud instead. That way each node in the cloud can crawl a different partition. When you say "short period of time", you are already setting yourself up for failure. You need hard numbers on what you want to attain.

The other key benefit to partitioning well is you can also avoid hitting servers during their peak hours and risking IP bans at their router level, in the case that the site doesn't simply throttle you.

Upvotes: 0

Related Questions