Saidur Rahman
Saidur Rahman

Reputation: 422

Efficient way to split CSV files in c#

I am trying to split a large Telecom bill which comes as a CSV file, 300MB into smaller chunks based on the Phone Number in the bill.

Some Phone Numbers have bills of 20 lines and some have more then 1000 lines, so it's dynamic. At first pass I read the bill and use LINQ to group them by the Phone Numbers and count the number of lines the bill contains for each phone number billing in the CSV file. Then insert into a List: split_id , starting line, ending line. (starting line starts from 0).

The script below is what I use to split the smaller bills. But this 300MB has unusually 7500+ phone numbers even though each file gets down to under 100KB it takes forever to process the split the bill.

    static void FileSplitWriter(List<SplitFile> pList, string info)
    {

        pList.ForEach(delegate(SplitFile per)
        {
            int startingLine = per.startingLine;
            int endingLine = per.endingLine;
            string[] fileContents = File.ReadAllLines(info);
            var query = fileContents.Skip(startingLine - 1).Take(endingLine - (startingLine - 1));
            string directoryPath = Path.GetDirectoryName(info);
            string filenameok = Path.GetFileNameWithoutExtension(info);

            StreamWriter ffs = new StreamWriter(directoryPath + "\\" + filenameok + "_split" + per.id + ".csv");
            foreach (string line in query)
            {
                ffs.WriteLine(line);
            }
            ffs.Dispose();
            ffs.Close();
        });


    }

My question is, is it possible to for this process to be much faster/efficient ? At this current rate it will take 3 hours or so to split the file alone.

Upvotes: 2

Views: 10579

Answers (3)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726569

Try moving the read of the file to outside the loop:

 static void FileSplitWriter(List<SplitFile> pList, string info) {
    string[] fileContents = File.ReadAllLines(info);
    string directoryPath = Path.GetDirectoryName(info);
    string filenameok = Path.GetFileNameWithoutExtension(info);
    pList.ForEach(delegate(SplitFile per) {
        int startingLine = per.startingLine;
        int endingLine = per.endingLine;
        var query = fileContents.Skip(startingLine - 1).Take(endingLine - (startingLine - 1));
        StreamWriter ffs = new StreamWriter(directoryPath + "\\" + filenameok + "_split" + per.id + ".csv");
        foreach (string line in query) {
            ffs.WriteLine(line);
        }
        ffs.Close();
        ffs.Dispose();
    });
}

Upvotes: 2

Peter
Peter

Reputation: 12711

It looks like the most inefficient part of this code is that you are reading the entire 300MB file into memory multiple times. You should only need to read it once ...

  1. Read the file into some enumerable data structure.
  2. Group by phone number.
  3. Loop over each group and write each to a file.

Note: if you're using .NET 4.0, you might gain some memory efficiency by using File.ReadLines() (instead of ReadAllLines).

Upvotes: 3

Oded
Oded

Reputation: 499002

I suggest you use one of the many fast CSV parsing libraries that exist.

There are several ones posted on code project and elsewhere, as well as filehelpers.

Upvotes: 2

Related Questions