Alexander Korolchuk
Alexander Korolchuk

Reputation: 365

How can I use class Task for parallel proccess

I'm a junior programmer, and I'm trying to solve a task. Using c# .net 4.0 I'm running through folders,to choose all *.xml files, and to write each file to new folder with new extension *.bin. For each file before writing I'm applying algorithm, which is written by another programmer and I don't know it's realisation.

So I read *.xml file, deserialise it and write it to new *.bin file. When I've not used parallel programming, I've had 1 minute for 2000 files. And now I've decided to apply parallel programming with Task. Now I create new Task for each file (all proccessing(read-deserialize-write) is in one Task), and now I have got 40 seconds. But I think that parallel programming helped me to reduce the time to 25-30 seconds.

Please, give your comments what I do wrong and how I have to realise this. Thanks.

byte[] buffer;
using (Stream stream = new FileInfo(file).OpenRead())
{
    buffer = new byte[stream.Length];
    stream.Read(buffer, 0, (int)stream.Length);
}

foreach (var culture in supportedCultures)
{
    CultureInfo currentCulture = culture;
    Tasks.Add(Task.Factory.StartNew(() =>
    {
        var memoryStream = new MemoryStream(buffer);
        Task<object> serializeTask = Task.Factory.StartNew(() =>
        {
            return typesManager.Load(memoryStream, currentCulture);
        }, TaskCreationOptions.AttachedToParent);

        string currentOutputDirectory = null;
        if (outputDirectory != null)
        {
            currentOutputDirectory = outputDirectory.Replace(PlaceForCultureInFolderPath,
                                                                 currentCulture
                                                                     .ToString());
            Directory.CreateDirectory(currentOutputDirectory);
        }

        string binFile = Path.ChangeExtension(Path.GetFileName(file), ".bin");
        string binPath = Path.Combine(
            currentOutputDirectory ?? Path.GetDirectoryName(file),
            binFile);

        using (FileStream outputStream = File.OpenWrite(binPath))
        {
            try
            {
                new BinaryFormatter().Serialize(outputStream,serializeTask.Result);
            }
            catch (SerializationException e)
            {
                ReportCompilationError(e.Message, null);
            }
        }
    }));
}

Upvotes: 0

Views: 420

Answers (3)

varun257
varun257

Reputation: 296

Task.Factory

var task1 = Task.Factory.StartNew(() =>
    {
       //some oepratation
    });
     var task2 = Task.Factory.StartNew(() =>
    {
       //some operations
    });
    Task.WaitAll(task1, task2);

But this wont guarantee a new thread for every task, since it uses the available threads and just schedules the jobs or assigns the tasks to whatever thread available.Hence, I would suggest you to use Parallel.ForEach

var options = new ParallelOptions { MaxDegreeOfParallelism = 2 // or more };
Parallel.ForEach ( list, options, a=> { } );

http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.foreach.aspx

Upvotes: 2

Pavel Voronin
Pavel Voronin

Reputation: 14005

First. There is no guarantee that TPL gives any performance hit.
As Jon says writing to HDD can decrease performance unless OS caches these files for later sequential writes. Definitely cache size has its limits.

Second. Default scheduler is oriented to utilize CPU cores so there's a possibility that only several tasks are processed parallel and others wait in a queue. You can change this default with explicitly setting ParallelOptions.MaxDegreeOfParallelism or calling WidthDegreeOfParallelism() in queries. Still it is scheduler who decides how many tasks run in parallel.

There's a nice free book about multithreading in .net

Upvotes: 1

Jon Skeet
Jon Skeet

Reputation: 1503290

Without seeing the code or knowing what the tasks are really doing, all we can do is offer some fairly general advice and diagnostics.

Is your code CPU-bound or IO-bound? (You should be able to tell this by looking at Performance Monitor and seeing how busy your CPUs are while running the code.)

If your code is IO-bound, and if you've got multiple files on a single physical non-SSD drive, then putting the work in parallel may well be making it worse as you're forcing the drive head to keep dotting all over the place.

If your code is CPU-bound then parallelization should be helping (as these sound like independent tasks) - again, you should be able to tell this by running your code first without parallelization and then with parallelization, in both cases looking at the CPU graphs. You would expect that in the serial version, only one CPU would be "busy" at a time, whereas in the parallel version all the CPUs should be busy.

Upvotes: 3

Related Questions