Reputation: 698
I have been given a very large Excel workbook that contains 500+ sheets. Each sheet represents a store, and the rows contain transactions at that store. Each sheet layout is identical. I have been requested to write a program that loops through each sheet, pulls specific transaction data, and writes everything to one gigantic CSV file.
I know that this kind of functionality is far better suited for relational database, but I have been asked to work on this as is.
I have written a program that successfully parses the data and writes it. The problem is that it takes almost a half-hour to complete the file write when reading and writing data synchronously.
I would like to accomplish this task by reading and writing the data from each sheet asynchronously. In C#, I would prefer to use the Task Parallel library for this, but am open to other options.
I am thinking about spinning off the worker threads from a foreach loop, like so:
foreach( Worksheet ws in _excelApp.Worksheets)
{
Parallel.Invoke(()=>ExportWorksheet(ws));
}
And then in the method (shortened for brevity):
private void ExportWorksheet(Worksheet ws)
{
using(FileStream fs = new new FileStream(fi.FullName, FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
{
for(int row = 1; row < 300; row++)
{
for(int column = 1; column < 20)
{
byte[] bytes = Encoding.ASCII.GetBytes(ws.Cells[row, column].Value.ToString() + ",");
fs.Write(bytes, 0, bytes.count());
}
fs.Write(Encoding.ASCII.GetBytes("\n"), 0, 2);
}
}
}
This gives me strange results, of course.
Am I on the right track? Should I be using a different encoding? Is there a cleaner way to accomplish the async write? Are there any threading rules being broken here?
All suggestions are welcome. Thanks for the help.
Upvotes: 1
Views: 2721
Reputation: 351
My experience with reading Excel from C# is, generally, nasty. All your computing time is spent trafficking with Excel - writing out CSV files takes no time at all. It isn't worth bothering with the separate threads.
In some cases I simply saved the spreadsheet as .csv and then parsed it from there. How this works from multiple sheets I don't know, but you might be able to page through the sheets saving them to .CSVs one by one. Then, read the .CSVs as long strings and clean them up.
Upvotes: 0
Reputation: 157098
Instead of looping through the rows and columns you'd better use the Value property of a range (for example the ActiveRange of a WorkSheet). This contains a two dimensional array containing all the data. This increases reading performance with a factor 1000.
For the other part. I rewrote it in two parts, removing the Excel references:
DateTime start = DateTime.Now;
//using (FileStream fs = new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true))
//{
// for (int row = 1; row < 3 * 1000; row++)
// {
// for (int column = 1; column < 3 * 1000; column++)
// {
// byte[] bytes = Encoding.ASCII.GetBytes(1.ToString() + ",");
// fs.Write(bytes, 0, bytes.Length);
// }
// byte[] bytes2 = Encoding.ASCII.GetBytes("\n");
// fs.Write(bytes2, 0, bytes2.Length);
// }
//}
using (TextWriter tw = new StreamWriter(new FileStream(@"C:\temp\x.x", FileMode.Append, FileAccess.Write, FileShare.Write, 1, true)))
{
for (int row = 1; row < 3 * 1000; row++)
{
for (int column = 1; column < 3 * 1000; column++)
{
tw.Write(1.ToString());
tw.Write(',');
}
tw.WriteLine();
}
}
DateTime end = DateTime.Now;
MessageBox.Show(string.Format("Time spent: {0:N0} ms.", (end - start).TotalMilliseconds));
The first part (which is almost identical to your code, now commented out) takes 3.670 (yes, over three thousand) seconds. The second part (not commented out) takes 12 seconds.
Upvotes: 2