Reputation: 10802
The documentation says that:
The following code segment is used to read a very large Excel
file using the DOM approach.
and then goes an example. I use it to implement reading a relatively large file with 700K rows. I have this code by now:
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(path, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
// no other code
}
When I start my program, I see how quickly - just in five seconds - it runs out of memory (>1G). And the debugger points to this line of code:
SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
So, I need to know whether OpenXML
really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).
EDIT
One extra mysterious thing. This code I get by now:
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
count++;
}
}
gives me in the count
variable over than a million of rows. However, I do have 14K on the first sheet and 700K on the second sheet. It is very strange. So, my extra question is how to parse only rows with data using SAX
approach. And one final mystery of reading large Excel files on OpenXML. One guy in this thread says that: "Turns out that the worksheets are enumerated backwards for some reason (so the first of my three sheets is actually index 3". So, my final extra question is how to get the sheet you want. At this moment I use this code:
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
But taking into account what the say, I'm not sure that in my case I would actually get the first worksheet.
Upvotes: 1
Views: 3001
Reputation: 12815
You seem to have a few questions, I'll try and tackle them one-by-one.
So, I need to know whether OpenXML really helps to read large files. And, if not, what are the alternatives (Interop does not help - I've already checked it).
Yes, the OpenXml SDK is great for reading large files but you may need to use a SAX approach rather than a DOM approach. From the same documentation you cite:
However, the DOM approach requires loading entire Open XML parts into memory, which can cause an Out of Memory exception when you are working with really large files.... Consider using SAX when you need to handle very large files.
The DOM approach loads the whole sheet into memory which for a large sheet can cause out of memory exceptions. Using the SAX approach you read each element in turn which reduces the memory consumption considerably.
So, my extra question is how to parse only rows with data using SAX approach
You are only getting the rows that have data (or at least the rows that exist in the XML) using the SDK. You appear to have asked this as a separate question which I've answered in more detail but essentially you are seeing the start and end of each row element using the code in your question. See my answer to your Why does OpenXML read rows twice question for more details.
So, my final extra question is how to get the sheet you want.
You need to find the Sheet
by name which is a descendant of the Workbook
. Once you have that you can use its Id
to get the WorksheetPart
:
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filename, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
Sheet sheet = workbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == sheetName).First();
if (sheet != null)
{
WorksheetPart worksheetPart = workbookPart.GetPartById(sheet.Id) as WorksheetPart;
//read worksheetPart...
}
}
Upvotes: 4