Zbone
Zbone

Reputation: 507

C# Parallel ForEach hold value

I am working on an app that searches for email addresses in Google search results' URLs. The problem is it needs to return the value it found in each page + the URL in which it found the email, to a datagridview with 2 columns: Email and URL. I am using Parallel.ForEach for this one but of course it returns random URLs and not the ones it really found the email on.

public static string htmlcon;  //htmlsource

public static List<string> emailList = new List<string>();

public static string Get(string url, bool proxy)
    {
        htmlcon = "";

        try
        {
            HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
            if (proxy)
                req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
            req.Method = "GET";
            req.UserAgent = Settings1.Default.UserAgent;
            if (Settings1.Default.EnableCookies == true)
            {
                CookieContainer cont = new CookieContainer();
                req.CookieContainer = cont;
            }
            WebResponse resp = req.GetResponse();
            StreamReader SR = new StreamReader(resp.GetResponseStream());
            htmlcon = SR.ReadToEnd();

            Thread.Sleep(400);
            resp.Close();
            SR.Close();
        }
        catch (Exception)
        {
            Thread.Sleep(500);
        }

        return htmlcon;

    }



  private void copyMails(string url)
    {    
        string emailPat = @"(\b[a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)";
        MatchCollection mailcol = Regex.Matches(htmlcon, emailPat, RegexOptions.Singleline);
        foreach (Match mailMatch in mailcol)
        {
            email = mailMatch.Groups[1].Value;
            if (!emailList.Contains(email))
            {
                emailList.Add(email);



                  Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
                  mailDataGrid.BeginInvoke(dgeins);

             }
        }
     }

  private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e)
 {
     //ALOT OF IRRELEVAMT STUFF BEING RUN

     Parallel.ForEach(allSElist.OfType<string>(), (s) =>
        {
            //Get URL
            Get(s, Settings1.Default.Proxyset);


            //match mails 1st page 
            copyMails(s);
            });

 }

so this is it: I execute a Get request(where "s" is the URL from the list) and then execute copyMails(s) from the URL's html source. It uses regex to copy the emails. If I do it without parallel it returns the correct URL for each email in the datagridview. How can I do this parallel an still get the correct match in the datagridview?

Thanks

Upvotes: 1

Views: 1363

Answers (2)

Richard
Richard

Reputation: 109005

You would be better off using PLINQ's Where to filter (pseudo code):

var results = from i in input.AsParallel()
              let u = get the URL from i
              let d = get the data from u
              let v = try get the value from d
              where v is found
              select new {
                Url = u,
                Value = v
              };

Underneath the AsParallel means that TPL's implementation of LINQ operators (Select, Where, ...) is used.


UPDATE: Now with more information

First there are a number of issues in your code:

  1. The variable htmlcon is static but used directly by multiple threads. This could well be your underlying problem. Consider just two input values. The first Get completes setting htmlcon, before that thread's call to copyMails starts the second thread's Get completes its HTML GET and writes to htmlcon. With `email

  2. The list emailList is also accessed without locking by multiple threads. Most collection types in .NET (and any other programming platform) are not thread safe, you need to limit access to a single thread at a time.

  3. You are mixing up various activities in each of your methods. Consider applying the singe responsibility principle.

  4. Thread.Sleep to handle an exception?! If you can't handle an exception (ie. resolve the condition) then do nothing. In this case if the action throws then the Parallel.Foreach will throw: that'll do until you define how to handle the HTML GET failing.

Three suggestions:

  1. In my experience clean code (to an obsessive degree) makes things easier: the details of the format don't matter (one true brace style is better, but consistency is the key). Just going through and cleaning up the formatting showed up issues #1 and #2.

  2. Good naming. Don't abbreviate anything used over more than a few lines of code unless that is a significant term for the domain. Eg. s for the action parameter in the parallel loop is really a url so call it that. This kind of thing immediately makes the code easier to follow.

  3. Think about that regex for emails: there are many valid emails that will not match (eg. use of + to provide multiple logical addresses: [email protected] will be delivered to [email protected] and can then be used for local rules). Also an apostrophe ("'") is a valid character (and known people frustrated by web sites that refused their addresses by getting this wrong).

Second: A relatively direct clean up:

public static string Get(string url, bool proxy) {

    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
    if (proxy) {
        req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
    }
    req.Method = "GET";
    req.UserAgent = Settings1.Default.UserAgent;
    if (Settings1.Default.EnableCookies == true) {
        CookieContainer cont = new CookieContainer();
        req.CookieContainer = cont;
    }
    using (WebResponse resp = req.GetResponse())
    using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
        return SR.ReadToEnd();
    }

}

private static Regex emailMatcher = new Regex(@"(\b[a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);

private static string[] ExtractEmails(string htmlContent) {    

    return emailMatcher.Matches(htmlContent).OfType<Match>
                       .Select(m => m.Groups[1].Value)
                       .Distinct()
                       .ToArray();
 }

private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {

    Parallel.ForEach(allSElist.OfType<string>(), url => {
        var htmlContent = Get(url, Settings1.Default.Proxyset);
        var emails = ExtractEmails(htmlContent);

        foreach (var email in emails) {
            Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
            mailDataGrid.BeginInvoke(dgeins);
        }
}

Here I have:

  • Made use of using statements to automate the cleanup of resources.
  • Eliminated all mutable shared state.
  • Regex is explicitly documented to have thread safe instance methods. So I only need a single instance.
  • Removed noise: no need to pass the URL to ExtractEmails because the extraction doesn't use the URL.
  • Get now only performs the HTML get, ExtreactEMail just the extraction

Third: The above will block threads on the slowest operation: the HTML GET.

The real concurrency benefit would be to replace HttpWebRequest.GetResponse and reading the response stream with their asynchronous equivalents.

Using Task would be the answer in .NET 4, but you need to directly work with Stream and encoding yourself because StreamReader doesn't provide any BeginABC/EndABC method pairs. But .NET 4.5 is almost here, so apply some async/await:

  • Nothing to do in ExtractEMails.
  • Get is now asynchronous, blocking in neither the HTTP GET or reading the result.
  • SEbgWorker_DoWork uses Tasks directly to avoid mixing too many different ways to work with TPL. Since Get returns a Task<string> can simple continue (when it hasn't failed – unless you specify otherwise ContinueWith will only continue if the previous task has completed successfully):

This should work in .NET 4.5, but without a set of valid URLs for which this will work I cannot test.

public static async Task<string> Get(string url, bool proxy) {

    HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
    if (proxy) {
        req.Proxy = new WebProxy(proxyIP + ":" + proxyPort);
    }
    req.Method = "GET";
    req.UserAgent = Settings1.Default.UserAgent;
    if (Settings1.Default.EnableCookies == true) {
        CookieContainer cont = new CookieContainer();
        req.CookieContainer = cont;
    }
    using (WebResponse resp = await req.GetResponseAsync())
    using (StreamReader SR = new StreamReader(resp.GetResponseStream())) {
        return await SR.ReadToEndAsync();
    }

}

private static Regex emailMatcher = new Regex(@"(\b[a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}\b)", RegexOptions.Singleline);

private static string[] ExtractEmails(string htmlContent) {    

    return emailMatcher.Matches(htmlContent).OfType<Match>
                       .Select(m => m.Groups[1].Value)
                       .Distinct()
                       .ToArray();
 }

private void SEbgWorker_DoWork(object sender, DoWorkEventArgs e) {

    tasks = allSElist.OfType<string>()
                     .Select(url => {
        return Get(url, Settings1.Default.Proxyset)
                .ContinueWith(htmlContentTask => {
                    // No TaskContinuationOptions, so know always OK here
                    var htmlContent = htmlContentTask.Result;
                    var emails = ExtractEmails(htmlContent);
                    foreach (var email in emails) {
                        // No InvokeAsync on WinForms, so do this the old way.
                        Action dgeins = () => mailDataGrid.Rows.Insert(0, email, url);
                        mailDataGrid.BeginInvoke(dgeins);
                    }
                });
    });

    tasks.WaitAll();
}

Upvotes: 7

Regfor
Regfor

Reputation: 8091

public static string htmlcon; //htmlsource

public static List emailList = new List();

Problem is because these members htmlcon and emailList are shared resource among thread and among iterations. Each your iteration in Parallel.ForEach is executed parallel. Thats why you have strange behaviour.

How to solve problem:

  • Modify your code and try to implement it without static variables or shared state.

As an option is change from Parallel.ForEach to TPL Task chaining, when you make this change then result of one parallel operation will be input data for other and it's as an options among many how to modify code to avoid shared state.

  • Use locking or concurrent collections. Your htmlcon variable could be made volatile but with list you should yous lock's or concurrent collections.

Better way is modify your code to avoid shared state, and how to do that are many options based on your implementation, not only task chaining.

Upvotes: 0

Related Questions