Reputation: 505
I'm trying to read a excel document and write it out as an csv.
I've figured out how to do this in several slightly different ways but it is slow as hell.
this is what I have that is working and is running through 2 arks with each 16384 rows and 5 columns of data in about 1 min 36 seconds
public void ToCSV(Stream excelStream, int i)
{
// IExcelDataReader excelReader = null;
using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
{
System.Data.DataSet excelsheets = excelReader.AsDataSet();
foreach (DataTable aSheet in excelsheets.Tables)
{
string strCSVData = "";
string sheetName = aSheet.TableName;
foreach (DataRow row in aSheet.Rows)
{
foreach (var column in row.ItemArray)
{
strCSVData += column.ToString().Replace(",", ",") + ",";
}
strCSVData += "\n";
}
string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";
StreamWriter csvFile = new StreamWriter(strOutputFileName, false);
csvFile.Write(strCSVData);
csvFile.Close();
}
}
}
Now I'm trying to speed it up. I was slightly faster using a normal for loop but nothing spectacular at about 1 min 33 seconds.
So I thought how about using the Parallel.foreach instead. however this resulted in either only a third of the data being written, or none currently.
this is how I changed the above method.
public void ToCSVParallel(Stream excelStream, int i)
{
// IExcelDataReader excelReader = null;
using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
{
System.Data.DataSet excelsheets = excelReader.AsDataSet();
for (int sheet = 0; sheet < excelsheets.Tables.Count; sheet++)
{
DataTable aSheet = excelsheets.Tables[sheet];
List<string> strCSVData = new List<string>();
string sheetName = aSheet.TableName;
IEnumerable<DataRow> dataSheet = aSheet.AsEnumerable();
Parallel.ForEach<DataRow>(dataSheet, row =>
{
string strRow = "";
for (int column = 0; column < row.ItemArray.Count(); column++)
{
strRow = row[column].ToString().Replace(",", ",") + ",";
}
strRow += "\n";
strCSVData.Append(strRow);
});
string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";
//StreamWriter csvFile = new StreamWriter(strOutputFileName, false);
System.IO.File.WriteAllLines(strOutputFileName, strCSVData);
// csvFile.Write(strCSVData);
//csvFile.Close();
}
}
}
Now I have no idea what I'm doing wrong But I'm fairly sure I must be misunderstanding how I could use the parallel.foreach but what am I doing wrong ?
or is there a better/smarter/simpler way to speed up my method ?
EDIT:
Based on all of your suggestions I've come up with the following changes.
public void ToCSVParallel(Stream excelStream, int i)
{
using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
{
System.Data.DataSet excelsheets = excelReader.AsDataSet();
for (int sheet = 0; sheet < excelsheets.Tables.Count; sheet++)
{
DataTable aSheet = excelsheets.Tables[sheet];
ConcurrentBag<string> strCSVData = new ConcurrentBag<string>();
string sheetName = aSheet.TableName;
IEnumerable<DataRow> dataSheet = aSheet.AsEnumerable();
Parallel.ForEach<DataRow>(dataSheet, row =>
{
StringBuilder strRow = new StringBuilder();
for (int column = 0; column < row.ItemArray.Count(); column++)
{
strRow.Append(row[column].ToString().Replace(",", ",") + ",");
}
strCSVData.Add(strRow.ToString());
});
string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";
System.IO.File.WriteAllLines(strOutputFileName, strCSVData);
}
}
}
However based on @Magnus suggestion I also changed my original method to this:
public void ToCSV(Stream excelStream, int i)
{
using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
{
System.Data.DataSet excelsheets = excelReader.AsDataSet();
foreach (DataTable aSheet in excelsheets.Tables)
{
string sheetName = aSheet.TableName;
string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";
using (StreamWriter csvFile = new StreamWriter(strOutputFileName, false))
{
foreach (DataRow row in aSheet.Rows)
{
foreach (var column in row.ItemArray)
{
csvFile.Write(column.ToString().Replace(",", ",") + ",");
}
csvFile.WriteLine();
}
}
}
}
}
and the result is astonishing to me.
the parallel is on average 1000 milliseconds slower than the modified Foreach loops.
however my idea to make the method faster is now within acceptable levels. the parallel takes on average about 8800 milliseconds. the foreach loop takes on average 7600 milliseconds. both of those is on 2 arks with each 16384 rows and 5 columns of data
Upvotes: 0
Views: 406
Reputation: 46997
A couple of issues with you code.
strCSVData.Append(strRow)
doesn't actually add anything to the list, it returns a new enumerable with the item appended.Add
that wouldnt work either since List is not thread safe.I think the main issue with your original (non parallel loop) is building the strCSVData
by concatenating. Since strings are immutable a new string would have to be created every time, and the larger it gets the slower it becomes. I would suggest that you open the StreamWriter
before the loop and write directly to that one instead.
...
StreamWriter csvFile = new StreamWriter(strOutputFileName, false);
for (int sheet = 0; sheet < excelsheets.Tables.Count; sheet++)
{
...
foreach (DataRow row in aSheet.Rows)
{
foreach (var column in row.ItemArray)
{
csvFile.WriteLine(column.ToString().Replace(",", ",") + ",");
}
}
...
}
Upvotes: 2