Parallel.ForEach nothing or very little is happening

Question

I'm trying to read a excel document and write it out as an csv.

I've figured out how to do this in several slightly different ways but it is slow as hell.

this is what I have that is working and is running through 2 arks with each 16384 rows and 5 columns of data in about 1 min 36 seconds

  public void ToCSV(Stream excelStream, int i)
    {
        // IExcelDataReader excelReader = null;

        using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
        {
            System.Data.DataSet excelsheets = excelReader.AsDataSet();

            foreach (DataTable aSheet in excelsheets.Tables)
            {
                string strCSVData = "";
                string sheetName = aSheet.TableName;

                foreach (DataRow row in aSheet.Rows)
                {
                    foreach (var column in row.ItemArray)
                    {
                        strCSVData += column.ToString().Replace(",", ",") + ",";
                    }
                    strCSVData += "
";
                }
                string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";
                StreamWriter csvFile = new StreamWriter(strOutputFileName, false);
                csvFile.Write(strCSVData);
                csvFile.Close();
            }
        }
    }

Now I'm trying to speed it up. I was slightly faster using a normal for loop but nothing spectacular at about 1 min 33 seconds.

So I thought how about using the Parallel.foreach instead. however this resulted in either only a third of the data being written, or none currently.

this is how I changed the above method.

 public void ToCSVParallel(Stream excelStream, int i)
    {
        // IExcelDataReader excelReader = null;

        using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
        {
            System.Data.DataSet excelsheets = excelReader.AsDataSet();

            for (int sheet = 0; sheet < excelsheets.Tables.Count; sheet++)
            {
                DataTable aSheet = excelsheets.Tables[sheet];
                List strCSVData = new List();
                string sheetName = aSheet.TableName;
                IEnumerable dataSheet = aSheet.AsEnumerable();
                Parallel.ForEach(dataSheet, row =>
                {
                    string strRow = "";
                    for (int column = 0; column < row.ItemArray.Count(); column++)
                    {
                        strRow = row[column].ToString().Replace(",", ",") + ",";
                    }
                    strRow += "
";
                    strCSVData.Append(strRow);
                });

                string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";
                //StreamWriter csvFile = new StreamWriter(strOutputFileName, false);
                System.IO.File.WriteAllLines(strOutputFileName, strCSVData);
              //  csvFile.Write(strCSVData);
                //csvFile.Close();
            }
        }
    }

Now I have no idea what I'm doing wrong But I'm fairly sure I must be misunderstanding how I could use the parallel.foreach but what am I doing wrong ?

or is there a better/smarter/simpler way to speed up my method ?

EDIT:

Based on all of your suggestions I've come up with the following changes.

public void ToCSVParallel(Stream excelStream, int i)
    {
        using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
        {
            System.Data.DataSet excelsheets = excelReader.AsDataSet();

            for (int sheet = 0; sheet < excelsheets.Tables.Count; sheet++)
            {
                DataTable aSheet = excelsheets.Tables[sheet];
                ConcurrentBag strCSVData = new ConcurrentBag();
                string sheetName = aSheet.TableName;
                IEnumerable dataSheet = aSheet.AsEnumerable();
                Parallel.ForEach(dataSheet, row =>
                {
                    StringBuilder strRow = new StringBuilder();
                    for (int column = 0; column < row.ItemArray.Count(); column++)
                    {
                        strRow.Append(row[column].ToString().Replace(",", ",") + ",");
                    }
                    strCSVData.Add(strRow.ToString());
                });

                string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";
                System.IO.File.WriteAllLines(strOutputFileName, strCSVData);
            }
        }
    }

However based on @Magnus suggestion I also changed my original method to this:

public void ToCSV(Stream excelStream, int i)
    {
        using (var excelReader = ExcelReaderFactory.CreateReader(excelStream))
        {
            System.Data.DataSet excelsheets = excelReader.AsDataSet();
            foreach (DataTable aSheet in excelsheets.Tables)
            {
                string sheetName = aSheet.TableName;
                string strOutputFileName = Directory.GetCurrentDirectory() + sheetName + i.ToString() + ".csv";

                using (StreamWriter csvFile = new StreamWriter(strOutputFileName, false))
                {
                    foreach (DataRow row in aSheet.Rows)
                    {
                        foreach (var column in row.ItemArray)
                        {
                            csvFile.Write(column.ToString().Replace(",", ",") + ",");
                        }
                        csvFile.WriteLine();
                    }
                }
            }
        }
    }

and the result is astonishing to me.

the parallel is on average 1000 milliseconds slower than the modified Foreach loops.

however my idea to make the method faster is now within acceptable levels. the parallel takes on average about 8800 milliseconds. the foreach loop takes on average 7600 milliseconds. both of those is on 2 arks with each 16384 rows and 5 columns of data

Magnus · Accepted Answer

A couple of issues with you code.

strCSVData.Append(strRow) doesn't actually add anything to the list, it returns a new enumerable with the item appended.
If you actually did do Add that wouldnt work either since List is not thread safe.
Doing parallel wouldn't process the items in order. (necessarily)

I think the main issue with your original (non parallel loop) is building the strCSVData by concatenating. Since strings are immutable a new string would have to be created every time, and the larger it gets the slower it becomes. I would suggest that you open the StreamWriter before the loop and write directly to that one instead.

...
StreamWriter csvFile = new StreamWriter(strOutputFileName, false);
for (int sheet = 0; sheet < excelsheets.Tables.Count; sheet++)
{
...
  foreach (DataRow row in aSheet.Rows)
  {
      foreach (var column in row.ItemArray)
      {
           csvFile.WriteLine(column.ToString().Replace(",", ",") + ",");
      }
  }
...
}

Parallel.ForEach nothing or very little is happening

Answers (1)

Related Questions