Next Door Engineer
Next Door Engineer

Reputation: 2896

reading a csv file with a million rows in parallel in c#

I have a CVS file with over 1 Million rows of data. I am planning to read them in parallel to improve efficiency. Can I do something like the following or is there a more efficient method?

namespace ParallelData
{
public partial class ParallelData : Form
{
    public ParallelData()
    {
        InitializeComponent();
    }

    private static readonly char[] Separators = { ',', ' ' };

    private static void ProcessFile()
    {
        var lines = File.ReadLines("BigData.csv");
        var numbers = ProcessRawNumbers(lines);

        var rowTotal = new List<double>();
        var totalElements = 0;

        foreach (var values in numbers)
        {
            var sumOfRow = values.Sum();
            rowTotal.Add(sumOfRow);
            totalElements += values.Count;
        }
        MessageBox.Show(totalElements.ToString());
    }

    private static List<List<double>> ProcessRawNumbers(IEnumerable<string> lines)
    {
        var numbers = new List<List<double>>();
        /*System.Threading.Tasks.*/
        Parallel.ForEach(lines, line =>
        {
            lock (numbers)
            {
                numbers.Add(ProcessLine(line));
            }
        });
        return numbers;
    }

    private static List<double> ProcessLine(string line)
    {
        var list = new List<double>();
        foreach (var s in line.Split(Separators, StringSplitOptions.RemoveEmptyEntries))
        {
            double i;
            if (Double.TryParse(s, out i))
            {
                list.Add(i);
            }
        }
        return list;
    }

    private void button2_Click(object sender, EventArgs e)
    {
        ProcessFile();
    }
}
}

Upvotes: 5

Views: 14289

Answers (3)

PawelZ
PawelZ

Reputation: 125

I checked those lines on my computer and it looks like using Parallel to read csv file without any cpu-expensive computation make no sense. It takes more time to run this in parallel than in one thread. Here are my result: For code above:

2699ms 2712ms (Checked twice just to confirm results)

Then with:

private static IEnumerable<List<double>> ProcessRawNumbers2(IEnumerable<string> lines)
{
        var numbers = new List<List<double>>();
        foreach(var line in lines)
        {
            lock (numbers)
            {
                numbers.Add(ProcessLine(line));
            }
        }
    return numbers;
}

Gives me: 2075ms 2106ms

So I think that if those numbers in csv does not require to be computed somehow (with some extensive calculation or so) in program and then stored in program, than it make no sense to use parallelism in such case as this add some overhead to it.

Upvotes: 0

Ragoczy
Ragoczy

Reputation: 2937

In general you should try to avoid having disk access on multiple threads. The disk is a bottleneck and will block, so might impact performance.

If the size of the lines in the file is not an issue, you should probably read the entire file in first, and then process in parallel.

If the file is too large to do that or it's not practical, then you could use BlockingCollection to load it. Use one thread to read the file and populate the BlockingCollection and then Parallel.ForEach to process the items in it. BlockingCollection allows you to specify the max size of the collection, so it will only read more lines from the file as what's already in the collection is processed and removed.

        static void Main(string[] args)
    {
        string  filename = @"c:\vs\temp\test.txt";
        int maxEntries = 2;

        var c = new BlockingCollection<String>(maxEntries);
        
        var taskAdding = Task.Factory.StartNew(delegate
        {
            var lines = File.ReadLines(filename);
            foreach (var line in lines)
            {
                c.Add(line);    // when there are maxEntries items
                                // in the collection, this line 
                                // and thread will block until 
                                // the processing thread removes 
                                // an item
            }

            c.CompleteAdding(); // this tells the collection there's
                                // nothing more to be added, so the 
                                // enumerator in the other thread can 
                                // end
        });

        while (c.Count < 1)
        {
            // this is here simply to give the adding thread time to
            // spin up in this much simplified sample
        }

        Parallel.ForEach(c.GetConsumingEnumerable(), i =>
           {
               // NOTE: GetConsumingEnumerable() removes items from the 
               //   collection as it enumerates over it, this frees up
               //   the space in the collection for the other thread
               //   to write more lines from the file
               Console.WriteLine(i);  
           });

        Console.ReadLine();
    }

As with some of the others, though, I have to ask the question: Is this something you really need to try optimizing through parallelization, or would a single-threaded solution perform well enough? Multithreading adds a lot of complexity and it's sometimes not worth it.

What kind of performance are you seeing that you want to improve upon?

Upvotes: 0

ken2k
ken2k

Reputation: 49013

I'm not sure it's a good idea. Depending on your hardware, the CPU won't be a bottleneck, the disk read speed will.

Another point: if your storage hardware is a magnetic hard disk, then then disk read speed is strongly related to how the file is physically stored in the disk; if the file is not fragmented (i.e. all file chunks are sequentially stored on the disk), you'll have better performances if you read line by line sequentially.

One solution would be to read the whole file in one time (if you have enough memory space, for 1 million row it should be OK) using File.ReadAllLines, store all lines in a string array, then process (i.e. parse using string.Split...etc.) in your Parallel.Foreach, if the rows order is not important.

Upvotes: 13

Related Questions