downtown1933
downtown1933

Reputation: 1

Performance issue with C# threading task and multiple web page downloads

I'm running code to download a large number of documents from county websites, usually tax statements. The code I'm running seems fast and efficient in the beginning, and works great until the file count reaches about 200. This is when performance begins to plummet. If I let it keep running, it still works, but gets to a point where it's painfully slow. I usually have to stop it, figure out which files haven't been downloaded, and start it over.

Any help on making this faster, more efficient, and smooth (regardless of file count) would be greatly appreciated.

I've been convinced the performance issue has to do with immediately writing the results to an html file.. I've tried storing the results in StringBuilder until the downloads finish, but of course I run out of memory.

I've also tried adjusting the MaxDegreeOfParallelism, which seemed to make a small impact by lowering it to 5, but the performance problem related to file count still exists.

    private void Run_Mass_TaxBillDownload()
    {
        string strTag = null;
        string county = countyName.SelectedItem.ToString() + "-";

        //Converting urlList to uriList...
        List<Uri> uriList = new List<Uri>();
        foreach (string url in TextViewer.Lines)//"TextViewer is a textbox where urls to be downloaded are stored...
        {
            if (url.Length > 5){Uri myUri = new Uri(url.Trim(), UriKind.RelativeOrAbsolute);uriList.Add(myUri);}
        }

        Parallel.ForEach(uriList, new ParallelOptions { MaxDegreeOfParallelism = 5 }, str =>
        {
            using (WebClient client = new WebClient())
            {
                //Extracting taxbill numbers from the url to use as file names in the saved file...
                string FirstString = null;
                string LastString = null;
                if (str.ToString().ToLower().Contains("&tptick")) { FirstString = "&TPTICK="; LastString = "&TPSX="; }
                if (str.ToString().ToLower().Contains("&ticket=")) { FirstString = "&ticket="; LastString = "&ticketsuff="; }
                if (str.ToString().ToLower().Contains("demandbilling")) { FirstString = "&ticketNumber="; LastString = "&ticketSuffix="; }

                //Start downloading...
                client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
                client.DownloadStringCompleted += new DownloadStringCompletedEventHandler(clientTaxBill_DownloadStringCompleted);
                client.DownloadStringAsync(str, county + (Between(str.ToString(), FirstString, LastString)));
            }
        });
    }
    private static void clientTaxBill_DownloadStringCompleted(Object sender, DownloadStringCompletedEventArgs e)
    {
        //Creating Output file....
        string deskTopPath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
        string outputPath = deskTopPath + "\\Downloaded Tax Bills";
        string errOutputFile = outputPath + "\\errorReport.txt";
        string results = null;
        string taxBillNum = e.UserState as string;

        try
        {
            File.WriteAllText(outputPath + "\\" + taxBillNum + ".html", e.Result.ToString());
        }
        catch
        {
            results = Environment.NewLine + "<<{ERROR}>> NOTHING FOUND FOR" + taxBillNum;
            File.AppendAllText(errOutputFile, results);
        }
    }

Upvotes: 0

Views: 71

Answers (1)

Steve Drake
Steve Drake

Reputation: 2048

If DownloadStringAsync just carries on, then it will run more than 5 downloads at once, DownloadStringCompleted will setup the call back then just continue and loop around again.

So, it will not be waiting for each one to complete.

ActionBlock is your friend as its just works better with async code and couple that with httpClient instead of WebClient

Try something like this

public static async Task Downloader()
{
    var urls = new string[] { "https://www.google.co.uk/", "https://www.microsoft.com/" };

    var ab = new ActionBlock<string>(async (url)  => 
    {
        var httpClient = new HttpClient();
        var httpResponse = await httpClient.GetAsync(url);
        var text = await httpResponse.Content.ReadAsStringAsync();

        // just write it to a file
        Console.WriteLine(text);

    }, new ExecutionDataflowBlockOptions() { MaxDegreeOfParallelism = 5 });

    foreach(var url in urls)
    {
        await ab.SendAsync(url);
    }

    ab.Complete(); 
    await ab.Completion;
    Console.WriteLine("Done");
    Console.ReadKey();
}

MaxDegreeOfParallelism = 5 that says, 5 threads, wait ab.SendAsync(url); is important as if you want to restrict the buffer size with BoundedCapacity = n this will wait until it has room whereas the ab.Post() method will not, it will just return false if it has no room

Upvotes: 1

Related Questions