OpenXML does not help to read large Excel files contrary to documentation

Question

The following code segment is used to read a very large Excel 
file using the DOM approach.

and then goes an example. I use it to implement reading a relatively large file with 700K rows. I have this code by now:

using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(path, false)) 
{
    WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
    WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
    SheetData sheetData = worksheetPart.Worksheet.Elements().First();
    // no other code
}

When I start my program, I see how quickly - just in five seconds - it runs out of memory (>1G). And the debugger points to this line of code:

SheetData sheetData = worksheetPart.Worksheet.Elements().First();

So, I need to know whether OpenXML really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).

EDIT

One extra mysterious thing. This code I get by now:

OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
while (reader.Read())
{
    if (reader.ElementType == typeof(Row))
    {
        count++;
    }                                
}

gives me in the count variable over than a million of rows. However, I do have 14K on the first sheet and 700K on the second sheet. It is very strange. So, my extra question is how to parse only rows with data using SAX approach. And one final mystery of reading large Excel files on OpenXML. One guy in this thread says that: "Turns out that the worksheets are enumerated backwards for some reason (so the first of my three sheets is actually index 3". So, my final extra question is how to get the sheet you want. At this moment I use this code:

WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart =     workbookPart.WorksheetParts.First();

But taking into account what the say, I'm not sure that in my case I would actually get the first worksheet.

petelids · Accepted Answer

You seem to have a few questions, I'll try and tackle them one-by-one.

So, I need to know whether OpenXML really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).

Yes, the OpenXml SDK is great for reading large files but you may need to use a SAX approach rather than a DOM approach. From the same documentation you cite:

However, the DOM approach requires loading entire Open XML parts into memory, which can cause an Out of Memory exception when you are working with really large files.... Consider using SAX when you need to handle very large files.

The DOM approach loads the whole sheet into memory which for a large sheet can cause out of memory exceptions. Using the SAX approach you read each element in turn which reduces the memory consumption considerably.

So, my extra question is how to parse only rows with data using SAX approach

You are only getting the rows that have data (or at least the rows that exist in the XML) using the SDK. You appear to have asked this as a separate question which I've answered in more detail but essentially you are seeing the start and end of each row element using the code in your question. See my answer to your Why does OpenXML read rows twice question for more details.

So, my final extra question is how to get the sheet you want.

You need to find the Sheet by name which is a descendant of the Workbook. Once you have that you can use its Id to get the WorksheetPart:

using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filename, false))
{
    WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;

    Sheet sheet = workbookPart.Workbook.Descendants().Where(s => s.Name == sheetName).First();
    if (sheet != null)
    {
        WorksheetPart worksheetPart = workbookPart.GetPartById(sheet.Id) as WorksheetPart;

        //read worksheetPart...
    }
}

OpenXML does not help to read large Excel files contrary to documentation

Answers (1)

Related Questions