Mojtaba Khooryani
Mojtaba Khooryani

Reputation: 1875

Getting HTML response fails respectively after first fail

I have a program which gets html code for ~500 webpages every 5 minutes

it runs correctly until first fail(unable to download source in 6 seconds)

after that all threads will fail

and if I restart program, again it runs correctly until ...

where I'm wrong, what I should do to do it better?

this function runs every 5 mins:

        foreach (Company company in companies)
        {
            string link = company.GetLink();

            Thread t = new Thread(() => F(company, link));
            t.Start();
            if (!t.Join(TimeSpan.FromSeconds(6)))
            {
                Debug.WriteLine( company.Name + " Fails");
                t.Abort();
            }
        }

and this function download html code

private void F(Company company, string link)
    {
        try
        {
            string htmlCode = GetInformationFromWeb.GetHtmlRequest(link);
            company.HtmlCode = htmlCode;
        }
        catch (Exception ex)
        {
        }
    }

and this class:

public class GetInformationFromWeb
{
    public static string GetHtmlRequest(string url)
    {
        using (MyWebClient client = new MyWebClient())
        {
            client.Encoding = Encoding.UTF8;
            string htmlCode = client.DownloadString(url);
            return htmlCode;
        }
    }
}

and web client class

public class MyWebClient : WebClient
{
    protected override WebRequest GetWebRequest(Uri address)
    {
        HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
        request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
        return request;
    }
}

Upvotes: 0

Views: 77

Answers (2)

Lukasz032
Lukasz032

Reputation: 377

I have four basic suggestions:

  1. Use HttpClient instead of obsolete WebClient. HttpClient can deal with asynchronous operations natively and has far more flexibility to take advantage of. You can even read downloaded contents to strings/streams on different thread since you can configure await not to schedule back your operations. Or even program the HttpClientHandler to break after 6 seconds and raise TaskCanceledException if this was exceeded.
  2. Avoid swallowing exceptions (like you do in your F function) as it breaks debugging and obfuscates the real cause of problems. Correctly-written program will never raise an exception during normal operation.
  3. You are using threads in an useless way, in which they are not even overlapping; they are just waiting for each other to start, because you are locking the calling loop after each thread's start. In .NET it would be better to do multitasking using Tasks (for example, by calling them as Task.Run(async delegate() { await yourTask(); }) (or AsyncContext.Run(...) if you need UI access) and it won't block anything.
  4. The whole GetInformationFromWeb class is pointless in the moment - and you are spawning multiple client objects also pointlessly, since one HTTP client object can handle multiple requests (if you'd use HttpClient even without additional bloat - you just instantiate it once as static global variable with all necessary configuration and then call it from any place using as little code as client.GetStringAsync(Uri uri).

OT: Is it some kind of an academic project?

Upvotes: 1

Robert
Robert

Reputation: 3543

IF your foreach is looping over 500 companies, and each is creating a new thread, it could be that your internet speed could become a bottleneck and you will receive timeouts over 6 seconds, and fail very often.

I suggest you to try with parallelism. Note MaxDegreeOfParallelism, which sets maximum amount of parallel executions. You can tune this to suit your needs.

 Parallel.ForEach(companies, new ParallelOptions { MaxDegreeOfParallelism = 10 }, (company) =>
            {
                try
                {
                    string htmlCode = GetInformationFromWeb.GetHtmlRequest(company.link);
                    company.HtmlCode = htmlCode;
                }
                catch(Exception ex)
                {
                    //ignore or process exception
                }
            });

Upvotes: 1

Related Questions