Reputation: 436
I'm developing a .NET4-based application that has to request third-party servers in order to get information from them. I'm using HttpClient to make these HTTP requests.
I have to create a hundred or a thousand requests in a short period of time. I would like to throttle the creation of these request to a limit (defined by a constant or something) so the other servers don't receive a lot of requests.
I've checked this link out that shows how to reduce the amount of tasks created at any time.
Here is my non-working approach:
// create the factory
var factory = new TaskFactory(new LimitedConcurrencyLevelTaskScheduler(level));
// use the factory to create a new task that will create the request to the third-party server
var task = factory.StartNew(() => {
return new HttpClient().GetAsync(url);
}).Unwrap();
Of course, the problem here is that even that one task at the time is created, a lot of requests will be created and processed at the same time, because they run in another scheduler. I could not find the way to change the scheduler to the HttpClient.
How should I handle this situation? I would like limit the amount of request created to a certain limit but do not block waiting for these request to finish.
Is this possible? Any ideas?
Upvotes: 6
Views: 1805
Reputation: 142164
You might consider creating a new DelegatingHandler to sit in the request/response pipeline of the HTTPClient that could keep count of the the number of pending requests.
Generally a single HTTPClient instance is used to process multiple requests. Unlike HttpWebRequest, disposing a HttpClient instance closes the underlying TCP/IP connection, so if you want to reuse connections you really need to re-use HTTPClient instances.
Upvotes: 0
Reputation: 244908
If you can use .Net 4.5, one way would be to use TransformBlock
from TPL Dataflow and set its MaxDegreeOfParallelism
. Something like:
var block = new TransformBlock<string, byte[]>(
url => new HttpClient().GetByteArrayAsync(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = level });
foreach (var url in urls)
block.Post(url);
block.Complete();
var result = new List<byte[]>();
while (await block.OutputAvailableAsync())
result.Add(block.Receive());
There is also another way of looking at this, through ServicePointManager
. Using that class, you can set limits on MaxServicePoints
(how many servers can you be connected to at once) and DefaultConnectionLimit
(how many connections can there be to each server). This way, you could start all your Task
s at the same moment, but only a limited amount of them would actually do something. Although limiting the number of Task
s (e.g. by using TPL Dataflow, as I suggested above) will be most likely more efficient.
Upvotes: 1
Reputation: 5296
You might consider launching a fixed set of threads. Each thread does the client net operations serially; maybe also pausing at certain points in order to throttle. This will give you specific control over loading; you can change your throttle policies and change the number of threads.
Upvotes: 0
Reputation: 2357
First, you should consider partitioning the workload according to website, or at least expose an abstraction that lets you choose how to partition the list of urls. e.g., one strategy could be by second-level domain e.g. yahoo.com, google.com.
The other thing is that if you are doing serious crawling, you may want to consider doing it on a cloud instead. That way each node in the cloud can crawl a different partition. When you say "short period of time", you are already setting yourself up for failure. You need hard numbers on what you want to attain.
The other key benefit to partitioning well is you can also avoid hitting servers during their peak hours and risking IP bans at their router level, in the case that the site doesn't simply throttle you.
Upvotes: 0