Producer/Consumer - Cascading Approach?

Question

i'm currently building a small server-related application with .net 4.0 and winforms. i would like to use the advantages of task parallel lib, but i'm a little bit waver about the best or 'correct' implementation here.

the purpose:

consuming files from a network path with regex pattern (consume every 15min.)
reading the files (csv style)
rewriting the files to skip some columns
transferring the file data to sql server by bulk insert or update

i'm thinking of a cascading approach, like this:

ProducerConsumerTask1 (Getting Files from the Network Path/Make the Files available to Read)
ProducerConsumerTask2 (Read the Files from Task1/Rewrite the Files from Task1)
ProducerConsumerTask3 (Getting the Rewritten Files/Transferring the Files from Task2 to DB)

And a little bit of Code:

private static BlockingCollection searchQueue = new BlockingCollection(limit);
private const int limit = 100;

public void StartFileTask()
{
    Task[] producers = new Task[1];
    producers[0] = Task.Factory.StartNew(() => ProduceFileSearchTask());


    Task.Factory.StartNew(() => ConsumeFileSearchTask());
}

public static void ProduceFileSearchTask()
{
    var pattern = new Regex(Properties.Settings.Default.DefaultRegexPattern);
    string path = Properties.Settings.Default.DefaultImportPath;

    IEnumerable files = new DirectoryInfo(path)
                                        .EnumerateFiles("*.*", SearchOption.AllDirectories)
                                        .Where(x => pattern.IsMatch(x.Name));

    for (int i = 0; i < files.ToList().Count(); i++)
    {
        ManagedFile _managedFile = new ManagedFile();
        _managedFile.Id = Guid.NewGuid();
        _managedFile.ManagedFileName = files.ElementAt(i).FullName;
        _managedFile.ManagedFileAddedOn = DateTime.Now;

        if (!searchQueue.IsAddingCompleted)
            searchQueue.Add(_managedFile);

        Thread.SpinWait(100000); 
    }           
}

public static void ConsumeFileSearchTask()
{
    foreach (var item in searchQueue.GetConsumingEnumerable())
    {
        // use ProducerTask for Reading the Files here
    }
}

It would be nice if someone share his thoughts on this idea. Is this a good way to deal with? What can be better in this case? another topic in this case: what about ui automation/reporting/status update to the ui? how can this be done? Events/delegates, huh?

Thanks!

paulik · Accepted Answer

Adding my comments as an answer :)

This looks like the perfect scenario to use Tasks.Dataflow. Check this out, it may help you a lot: Tasks.DataFlow Whitepaper

Another suggested approach: One task reads the new files and puts some of them to BlockingCollection(aka Producer-Consumer). Consumer task maintain a list of concurrent tasks and reads from the collection to schedule the new ones. By tweaking consumer task and how many files it can track simultaneously you check you performance. Once consumer gets a notification that some task is finished, read from producer again and schedule another one. They will be independently parallel.

Another framework to look at is Reactive Extensions and convert your source to observable collection of files and apply throttling in there.

Producer/Consumer - Cascading Approach?

Answers (1)

Related Questions