Reputation: 803
I am working on a project in which i am reading few cells from an excel file using OpenXML SDK in VB.Net and storing it in DataTable. It works just fine for the medium sized and small files however when when i try to open a large file of size 107MB, i get an OutOfMemory exception after reading through few sheets. I am able to open the file by double clicking(it takes a while though) Below are the code that i am using. Please let me know if i can improve the process of reading by reducing the memory consumption
Dim CellRage As String() = {"AG65", "AG281", "AG335", "AG389", "AG443", "AG497", "AG551", "AG800", "AG913", "AG1081", "AG1165", "AG1305"}
Dim CellValue(13) As String
Using myWorkbook As SpreadsheetDocument = SpreadsheetDocument.Open(stream, False)
workbookPart = myWorkbook.WorkbookPart
For Each worksheetpart As WorksheetPart In workbookPart.WorksheetParts
For count As Integer = 0 To CellRage.GetUpperBound(0) -1
CellValue(count) = CellValue(workbookPart, sheetName, CostCellRage(count - 2))
Next
'After few sheets throws OutofMemory Exception
Next
End Using
Private Shared Function GetCellValue(workbookPart As WorkbookPart, sheetName As String, cellAddress As String) As String
Dim theCell As Cell
Dim wsPart As WorksheetPart
Dim worksheet As Sheet
Dim value As String
Dim stringTablePart As SharedStringTablePart = workbookPart.SharedStringTablePart
worksheet = workbookPart.Workbook.Descendants(Of Sheet).Where(Function(s) s.Name = sheetName).FirstOrDefault
wsPart = CType(workbookPart.GetPartById(worksheet.Id), WorksheetPart)
theCell = wsPart.Worksheet.Descendants(Of Cell).Where(Function(c) c.CellReference = cellAddress).FirstOrDefault
If theCell.ChildElements.Count = 0 Then
Return ""
End If
value = theCell.CellValue.Text
If (theCell.DataType IsNot Nothing) AndAlso (theCell.DataType.ToString() = "s") Then
value = stringTablePart.SharedStringTable.ChildElements(Int32.Parse(value)).InnerText
End If
Return value
End Function
Thanks for looking into this
Upvotes: 0
Views: 4255
Reputation: 12815
The problem you have is that your code is reading each sheet into memory which is eventually leading to you using too much memory.
As Jesper points out you can use a SAX approach rather than the DOM approach you are currently using. To read an Excel document using OpenXML with a SAX approach you can use the OpenXmlReader
class. This will read the file in a more memory efficient way by handling sections of the XML at a time rather than the whole DOM which should allow you to process large files.
The OpenXmlReader
reads the XML contents of the file part in XML chunks; think of it like reading a file using a Stream
. We can't jump to an arbitrary cell as we haven't read the whole document yet. Instead what we need to do is read each row and ignore the rows we don't want. Once we have the row we are interested in we have the entire row XML available so at this point we can jump directly to the cells within that row that we are interested in.
Dim desiredColumnNumber As Integer = 33
Dim cellRange As Integer() = New Integer() {65, 281, 335, 389, 443, 497, _
551, 800, 913, 1081, 1165, 1305}
Using reader As OpenXmlReader = OpenXmlReader.Create(worksheetPart)
While reader.Read()
'we want to find the first row
If reader.ElementType = GetType(Row) Then
Do
If Not cellRange.Contains(Convert.ToInt32(CUInt(row.RowIndex))) Then
'we're not interested in this row so continue
Continue Do
End If
Dim row As Row = DirectCast(reader.LoadCurrentElement(), Row)
If row.HasChildren Then
'get cell in column AG
Dim cell As Cell = DirectCast(row.ChildElements(desiredColumnNumber), Cell)
'do something with the cell...
End If
'move to the next row
Loop While reader.ReadNextSibling()
End If
End While
End Using
In the code above I have split your cell ranges into a cell number (desiredCellNumber
) and an array of Integer
(cellRange
) that stores the number of each row we are interested in (if you can't split this in your original code you'll have to convert each cell reference into the correct format).
After that we then check the RowIndex
property on each row to see if it exists in the cellRange
. If not then we move on to the next row but if it is a row we are interested we can access the cells by accessing the Row.ChildElements
property. In the code above we are only interested in column 33 so we read that cell value using row.ChildElements(desiredColumnNumber)
.
As an aside, when I'm dealing with large files like this I usually read the SharedStringTablePart
into a Dictionary
or similar first and then read from the there when I need to get a string value. Obviously the amount of memory this consumes depends heavily on the contents of your Excel file so you might want to do something cleverer to minimise the memory used when storing the Shared Strings. The code for reading the shared strings is (almost) identical to the above:
Dim sharedStrings As New Dictionary(Of Integer, String)()
If stringTablePart IsNot Nothing Then
Using reader As OpenXmlReader = OpenXmlReader.Create(stringTablePart)
Dim i As Integer = 0
While reader.Read()
If reader.ElementType = GetType(SharedStringItem) Then
Dim sharedStringItem As SharedStringItem = DirectCast(reader.LoadCurrentElement(), SharedStringItem)
sharedStrings.Add(i), If(sharedStringItem.Text IsNot Nothing, sharedStringItem.Text.Text, String.Empty))
i = i + 1
End If
End While
End Using
End If
Upvotes: 3
Reputation: 2013
Openxml SDK uses a 'DOM-approach' when loading files into memory.
For processing large amounts of data, you should consider using a SAX approach like the XmlReader.
Of course, XmlReader is not a part of the OpenXml SDK, but it should give you better performance and a much lower memory footprint.
Upvotes: 1