Craig
Craig

Reputation: 41

reading a large open xml spreadsheet

i need to read (and parse) large spreadsheet files (20-50MB) using the openxml libraries and there doesn't seem to be a way to stream the rows one at a time for parsing.

i'm consistently getting Out Of Memory exceptions as it seems as soon as i attempt to access a row (or iterate) the entire row contents are loaded (100K+ rows).

each of the calls, whether Elements.Where( with query ) or Descendants ( ) seem to load the entire rowset

is there a way to stream or just read a row at a time ?

thx

Upvotes: 0

Views: 3928

Answers (3)

Jasmin Akther Suma
Jasmin Akther Suma

Reputation: 648

Here is the code to read large excel file with multiple sheets using SAX approach:

public static DataTable ReadIntoDatatableFromExcel(string newFilePath)
        {
            /*Creating a table with 20 columns*/
            var dt = CreateProviderRvenueSharingTable();

            try
            {
                /*using stream so that if excel file is in another process then it can read without error*/
                using (Stream stream = new FileStream(newFilePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                {
                    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(stream, false))
                    {
                        var workbookPart = spreadsheetDocument.WorkbookPart;
                        var workbook = workbookPart.Workbook;

                        /*get only unhide tabs*/
                        var sheets = workbook.Descendants<Sheet>().Where(e => e.State == null);

                        foreach (var sheet in sheets)
                        {
                            var worksheetPart = (WorksheetPart)workbookPart.GetPartById(sheet.Id);

                            /*Remove empty sheets*/
                            List<Row> rows = worksheetPart.Worksheet.Elements<SheetData>().First().Elements<Row>()
                                .Where(r => r.InnerText != string.Empty).ToList();

                            if (rows.Count > 1)
                            {
                                OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);

                                int i = 0;
                                int BTR = 0;/*Break the reader while empty rows are found*/

                                while (reader.Read())
                                {
                                    if (reader.ElementType == typeof(Row))
                                    {
                                        /*ignoring first row with headers and check if data is there after header*/
                                        if (i < 2)
                                        {
                                            i++;
                                            continue;
                                        }

                                        reader.ReadFirstChild();

                                        DataRow row = dt.NewRow();

                                        int CN = 0;

                                        if (reader.ElementType == typeof(Cell))
                                        {
                                            do
                                            {
                                                Cell c = (Cell)reader.LoadCurrentElement();

                                                /*reader skipping blank cells so data is getting worng in datatable's rows according to header*/
                                                if (CN != 0)
                                                {
                                                    int cellColumnIndex =
                                                        ExcelHelper.GetColumnIndexFromName(
                                                            ExcelHelper.GetColumnName(c.CellReference));

                                                    if (cellColumnIndex < 20 && CN < cellColumnIndex - 1)
                                                    {
                                                        do
                                                        {
                                                            row[CN] = string.Empty;
                                                            CN++;
                                                        } while (CN < cellColumnIndex - 1);
                                                    }
                                                }

                                                /*stopping execution if first cell does not have any value which means empty row*/
                                                if (CN == 0 && c.DataType == null && c.CellValue == null)
                                                {
                                                    BTR++;
                                                    break;
                                                }

                                                string cellValue = GetCellValue(c, workbookPart);
                                                row[CN] = cellValue;
                                                CN++;

                                                /*if any text exists after T column (index 20) then skip the reader*/
                                                if (CN == 20)
                                                {
                                                    break;
                                                }
                                            } while (reader.ReadNextSibling());
                                        }

                                        /*reader skipping blank cells so fill the array upto 19 index*/
                                        while (CN != 0 && CN < 20)
                                        {
                                            row[CN] = string.Empty;
                                            CN++;
                                        }

                                        if (CN == 20)
                                        {
                                            dt.Rows.Add(row);
                                        }
                                    }
                                    /*escaping empty rows below data filled rows after checking 5 times */
                                    if (BTR > 5)
                                        break;
                                }
                                reader.Close();
                            }                            
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                throw ex;
            }
            return dt;
        }

  private static string GetCellValue(Cell c, WorkbookPart workbookPart)
        {
            string cellValue = string.Empty;
            if (c.DataType != null && c.DataType == CellValues.SharedString)
            {
                SharedStringItem ssi =
                    workbookPart.SharedStringTablePart.SharedStringTable
                        .Elements<SharedStringItem>()
                        .ElementAt(int.Parse(c.CellValue.InnerText));
                if (ssi.Text != null)
                {
                    cellValue = ssi.Text.Text;
                }
            }
            else
            {
                if (c.CellValue != null)
                {
                    cellValue = c.CellValue.InnerText;
                }
            }
            return cellValue;
        }

public static int GetColumnIndexFromName(string columnNameOrCellReference)
        {
            int columnIndex = 0;
            int factor = 1;
            for (int pos = columnNameOrCellReference.Length - 1; pos >= 0; pos--)   // R to L
            {
                if (Char.IsLetter(columnNameOrCellReference[pos]))  // for letters (columnName)
                {
                    columnIndex += factor * ((columnNameOrCellReference[pos] - 'A') + 1);
                    factor *= 26;
                }
            }
            return columnIndex;
        }

        public static string GetColumnName(string cellReference)
        {
            /* Advance from L to R until a number, then return 0 through previous position*/
            for (int lastCharPos = 0; lastCharPos <= 3; lastCharPos++)
                if (Char.IsNumber(cellReference[lastCharPos]))
                    return cellReference.Substring(0, lastCharPos);

            throw new ArgumentOutOfRangeException("cellReference");
        }

private static DataTable CreateProviderRvenueSharingTable()
    {
        DataTable dt = new DataTable("RevenueSharingTransaction");

        // Create fields
        dt.Columns.Add("IMId", typeof(string));
        dt.Columns.Add("InternalPlanId", typeof(string));
        dt.Columns.Add("PaymentReceivedDate", typeof(string));
        dt.Columns.Add("PaymentAmount", typeof(string));
        dt.Columns.Add("BPS", typeof(string));
        dt.Columns.Add("Asset", typeof(string));
        dt.Columns.Add("PaymentType", typeof(string));
        dt.Columns.Add("InvestmentManager", typeof(string));
        dt.Columns.Add("Frequency", typeof(string));
        dt.Columns.Add("StartDateForPayment", typeof(string));
        dt.Columns.Add("EndDateForPayment", typeof(string));
        dt.Columns.Add("Participant", typeof(string));
        dt.Columns.Add("SSN", typeof(string));
        dt.Columns.Add("JEDate", typeof(string));
        dt.Columns.Add("GL", typeof(string));
        dt.Columns.Add("JEDescription", typeof(string));
        dt.Columns.Add("CRAccount", typeof(string));
        dt.Columns.Add("ReportName", typeof(string));
        dt.Columns.Add("ReportLocation", typeof(string));
        dt.Columns.Add("Division", typeof(string));
        return dt;
    }

Code works for: 1. read the sheets from first in ascending order 2. if excel file is being used by another process, OpenXML still reads that. 3. This code reads blank cells 4. skip empty rows after reading complete. 5. it reads 5000 rows within 4 seconds.

Upvotes: 0

Craig
Craig

Reputation: 41

i found an answer. if you use the OpenXmlReader on the worksheet part you can iterate through and effectively lazy load the elements you come across.

OpenXmlReader oxr = OpenXmlReader.Create(worksheetPart); 

look for

ElementType == typeof(SheetData) 

and load the row (lazy)

Row row = (Row)oxr.LoadCurrentElement();

Upvotes: 4

D.C.
D.C.

Reputation: 15588

do the openxml libraries use dom or sax models? with dom you usually have to hold the entire document in memory at once, but with sax you can stream the events as they come.

Upvotes: 0

Related Questions