Reputation: 141638
We have an Excel workbook that is fairly large. Around 3,300 columns and several thousand rows.
We've found that attempting to do anything with the data results in high memory usage, around 3 GB.
It appears that the DocumentFormat.OpenXml
package, when iterated, keeps a full object structure of the worksheet in memory. Generally, we are doing this:
var workbookPart = _document.WorkbookPart;
var worksheets = workbookPart.Workbook.Descendants<Sheet>();
foreach(var worksheet in worksheets)
{
var worksheetPart = (WorksheetPart) workbookPart.GetPartById(worksheet.Id);
foreach(var row in worksheetPart.Worksheet.Descendants<Row>())
{
foreach(var cell in row.Descendants<Cell>())
{
var (_, value) = ParseCell(cell);
}
}
}
ParseCell
simply gets the contents of the Cell
by looking up the string value from the SharedStringTable
on the workbook, or if it is a number, parsing the number.
Simply running this code that does nothing with the result of ParseCell
still uses considerable memory.
When we profiled this code, we noticed that there is a Cell
on the heap for every cell in the sheet despite our best attempts at using IEnumerable<T>
APIs to void having large collections in memory.
This is fairly close to the recommended usage of this Nuget package.
From profiling, it appears the issue is that each Cell
has a strong reference to the next Cell
, and likewise for Row
.
Each Cell
has a field called and _next
which is what is keeping each Cell with a strong root. Cell A has a strong reference to Cell B, B to C, C to D.
Row
has a similar structure, where Row 0 has a _next
field to Row 1, so on and so forth, so for each Row
that we go through, it keeps a strong reference to the next Row
.
So everything is tied together. When I looked at this with WinDbg after it processed that last Row
, there were exactly the number of Cell
s on the heap from !dumpheap -stat
as the workbook contained.
The way that we are using this SDK won't scale to more rows. Is there a way to use this package more efficiently and process an Worksheet on a row-by-row basis without it keeping an object graph of the whole worksheet in memory?
Upvotes: 3
Views: 961
Reputation: 141638
A suitable solution here is to use the OpenXmlReader
XML reader. The other key thing is to use Elements
instead of Decendents
to avoid looking too deep in the XML structure.
using (var reader = OpenXmlReader.Create(worksheetPart))
{
while (reader.Read())
{
if (typeof(Row).IsAssignableFrom(reader.ElementType))
{
var row = (Row)reader.LoadCurrentElement();
foreach (var cell in row.Elements<Cell>())
{
var (_, value) = ParseCell(cell);
}
}
}
}
This does indeed "stream" the elements and memory usage is minimal.
Upvotes: 1