vcsjones
vcsjones

Reputation: 141638

Streaming through Excel data with Open XML SDK

We have an Excel workbook that is fairly large. Around 3,300 columns and several thousand rows.

We've found that attempting to do anything with the data results in high memory usage, around 3 GB.

It appears that the DocumentFormat.OpenXml package, when iterated, keeps a full object structure of the worksheet in memory. Generally, we are doing this:

var workbookPart = _document.WorkbookPart;
var worksheets = workbookPart.Workbook.Descendants<Sheet>();

foreach(var worksheet in worksheets)
{
    var worksheetPart = (WorksheetPart) workbookPart.GetPartById(worksheet.Id);
    foreach(var row in worksheetPart.Worksheet.Descendants<Row>())
    {
        foreach(var cell in row.Descendants<Cell>())
        {
            var (_, value) = ParseCell(cell);
        }
    }
}

ParseCell simply gets the contents of the Cell by looking up the string value from the SharedStringTable on the workbook, or if it is a number, parsing the number.

Simply running this code that does nothing with the result of ParseCell still uses considerable memory.

When we profiled this code, we noticed that there is a Cell on the heap for every cell in the sheet despite our best attempts at using IEnumerable<T> APIs to void having large collections in memory.

This is fairly close to the recommended usage of this Nuget package.

From profiling, it appears the issue is that each Cell has a strong reference to the next Cell, and likewise for Row.

Each Cell has a field called and _next which is what is keeping each Cell with a strong root. Cell A has a strong reference to Cell B, B to C, C to D.

Row has a similar structure, where Row 0 has a _next field to Row 1, so on and so forth, so for each Row that we go through, it keeps a strong reference to the next Row.

So everything is tied together. When I looked at this with WinDbg after it processed that last Row, there were exactly the number of Cells on the heap from !dumpheap -stat as the workbook contained.

The way that we are using this SDK won't scale to more rows. Is there a way to use this package more efficiently and process an Worksheet on a row-by-row basis without it keeping an object graph of the whole worksheet in memory?

Upvotes: 3

Views: 961

Answers (1)

vcsjones
vcsjones

Reputation: 141638

A suitable solution here is to use the OpenXmlReader XML reader. The other key thing is to use Elements instead of Decendents to avoid looking too deep in the XML structure.

using (var reader = OpenXmlReader.Create(worksheetPart))
{
    while (reader.Read())
    {
        if (typeof(Row).IsAssignableFrom(reader.ElementType))
        {
            var row = (Row)reader.LoadCurrentElement();
            foreach (var cell in row.Elements<Cell>())
            {
                var (_, value) = ParseCell(cell);
            }
        }
    }
}

This does indeed "stream" the elements and memory usage is minimal.

Upvotes: 1

Related Questions