Reputation: 1297
I'm working on a C# winforms app and I have around 84 urls that I want to parse them using html agility pack
for 84 records it takes 150 seconds to complete the job with below code.
I was wondering what options do I have to make it run faster? any help is much appreciated!
Following is my code structure to do the job
public class URL_DATA
{
public string URL { get; set; }
public HtmlDocument doc { get; set; }
}
then I call the below function to do the job
public async Task ProcessUrls(string cookie)
{
var tsk = new List<Task>();
//UrlsToProcess is List<URL_DATA>
UrlsToProcess.ForEach(async data =>
{
tsk.Add(Task.Run(async () =>
{
var htmToParse = await ScrapUtils.GetAgilityDocby(cookie, data.URL);
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmToParse);
data.doc = htmlDoc;
}));
});
await Task.WhenAll(tsk).ConfigureAwait(false);
}
and finally below is the method I use to get request string.
public static async Task<string> GetAgilityDocby(string cookie, string url)
{
using (var wc = new WebClient())
{
wc.Proxy = null;// WebRequest.DefaultWebProxy;// GlobalProxySelection.GetEmptyWebProxy();
wc.Headers.Add(HttpRequestHeader.Cookie, cookie);
wc.Headers.Add(HttpRequestHeader.UserAgent,
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36");
wc.Encoding = Encoding.UTF8;
test++;
return await wc.DownloadStringTaskAsync(url).ConfigureAwait(false);
}
}
Upvotes: 0
Views: 432
Reputation: 896
Try increasing the minimum running Thread number by
ThreadPool.SetMinThreads(84,84);
This should speed things up alot.
As for the Task-Creation pointed out by Ilya, i would recomment you omit the Task.Run / AwaitAll part completely and use the Parallel mechanism, which was developed for exactly this kind of problem:
Parallel.ForEach(UrlsToProcess, data =>
{
var htmToParse = ScrapUtils.GetAgilityDocby(cookie, data.URL);
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmToParse);
data.doc = htmlDoc;
});
Upvotes: 0
Reputation: 30205
You are using a ForEach
with asynchronous lambda. I have a suspicion that it makes your code run sequentially instead of parallel since each next iteration will do await.
So what you can do to figure that out for sure:
You can change your task creation code to this e.g. to try:
var allTasks = myUrls.Select(url => Task.Run(() => {yourCode})
Task.WhenAll(allTasks);
Upvotes: 1