field_b
field_b

Reputation: 698

How to read Excel worksheets and write to a file asynchronously?

I have been given a very large Excel workbook that contains 500+ sheets. Each sheet represents a store, and the rows contain transactions at that store. Each sheet layout is identical. I have been requested to write a program that loops through each sheet, pulls specific transaction data, and writes everything to one gigantic CSV file.

I know that this kind of functionality is far better suited for relational database, but I have been asked to work on this as is.

I have written a program that successfully parses the data and writes it. The problem is that it takes almost a half-hour to complete the file write when reading and writing data synchronously.

I would like to accomplish this task by reading and writing the data from each sheet asynchronously. In C#, I would prefer to use the Task Parallel library for this, but am open to other options.

I am thinking about spinning off the worker threads from a foreach loop, like so:

foreach( Worksheet ws in _excelApp.Worksheets)
{
    Parallel.Invoke(()=>ExportWorksheet(ws));
}

And then in the method (shortened for brevity):

private void ExportWorksheet(Worksheet ws)
{         
     using(FileStream fs = new new FileStream(fi.FullName, FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
     {
         for(int row = 1; row < 300; row++)
         {
              for(int column = 1; column < 20)
              {
                   byte[] bytes = Encoding.ASCII.GetBytes(ws.Cells[row, column].Value.ToString() + ",");
                   fs.Write(bytes, 0, bytes.count());
              }

              fs.Write(Encoding.ASCII.GetBytes("\n"), 0, 2);
         } 
     }

}

This gives me strange results, of course.

Am I on the right track? Should I be using a different encoding? Is there a cleaner way to accomplish the async write? Are there any threading rules being broken here?

All suggestions are welcome. Thanks for the help.

Upvotes: 1

Views: 2721

Answers (2)

Meredith Poor
Meredith Poor

Reputation: 351

My experience with reading Excel from C# is, generally, nasty. All your computing time is spent trafficking with Excel - writing out CSV files takes no time at all. It isn't worth bothering with the separate threads.

In some cases I simply saved the spreadsheet as .csv and then parsed it from there. How this works from multiple sheets I don't know, but you might be able to page through the sheets saving them to .CSVs one by one. Then, read the .CSVs as long strings and clean them up.

Upvotes: 0

Patrick Hofman
Patrick Hofman

Reputation: 157098

Instead of looping through the rows and columns you'd better use the Value property of a range (for example the ActiveRange of a WorkSheet). This contains a two dimensional array containing all the data. This increases reading performance with a factor 1000.

For the other part. I rewrote it in two parts, removing the Excel references:

        DateTime start = DateTime.Now;

        //using (FileStream fs = new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
        //{
        //    for (int row = 1; row < 3 * 1000; row++)
        //    {
        //        for (int column = 1; column < 3 * 1000; column++)
        //        {
        //            byte[] bytes = Encoding.ASCII.GetBytes(1.ToString() + ",");
        //            fs.Write(bytes, 0, bytes.Length);
        //        }

        //        byte[] bytes2 = Encoding.ASCII.GetBytes("\n");
        //        fs.Write(bytes2, 0, bytes2.Length);
        //    }
        //}

        using (TextWriter tw = new StreamWriter(new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true)))
        {
            for (int row = 1; row < 3 * 1000; row++)
            {
                for (int column = 1; column < 3 * 1000; column++)
                {
                    tw.Write(1.ToString());
                    tw.Write(',');
                }

                tw.WriteLine();
            }
        }

        DateTime end = DateTime.Now;

        MessageBox.Show(string.Format("Time spent: {0:N0} ms.", (end - start).TotalMilliseconds));

The first part (which is almost identical to your code, now commented out) takes 3.670 (yes, over three thousand) seconds. The second part (not commented out) takes 12 seconds.

Upvotes: 2

Related Questions