Reputation: 7143
Situation I need to solve:
My client has some extremely large .xlsx files that resemble a database table (each row is a record, cols are fields)
I need to help them process those files (search, filter, etc).
By large I mean the smallest of them has 1 million records.
What I have tried:
System.OverflowException
, my guess is that it's basically out of memory, because a 200MB xlsx file already took me 4GB of memory to read.Due to confidentiality I cannot share the actual file, but you can easily create a similar structure with 60 cols (first name, last name, dob, etc), and about 1M records.
The question would be solved as soon as you can read an .xlsx file with that criteria, remove half of the records then write to another place without facing memory issue.
It would be great if there is a way to slowly read small chunks of data from the file row-by-row, but solutions I have found only read the entire file at the same time.
Upvotes: 4
Views: 4922
Reputation: 1958
Short Answer: Use any library that does not load sharedStringTable.xml file into RAM. I have seen many library but all load sharedStringTable into the RAM. So I have developed a one myself to solve this problem. Here is Source Code and NuGet package.
Long Answer: Xlsx file format is basically zip file that contains sheets in form of xml file. For the optimization purpose, Xlsx file stores duplicate cell value in a sharedString.xml file and refers the index of sharedString entry for duplicate cell value. Most of the library loads whole sharedString into a List/Dictionary for lookup purpose, and because of that processing large Xlsx files consume lot of RAM.
XlsxHelper takes a different approach. For small files it loads sharedString into RAM so that it can do lookup very fast. For larger file it indexes sharedString into a file and does a fileStream read, As a result for large file it takes very less RAM but it is slightly slower.
Upvotes: 1
Reputation: 1685
For reading Excel
file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M
:
using (var stream = File.Open("C:\\temp\\input.xlsx", FileMode.Open, FileAccess.Read))
{
using (var reader = ExcelReaderFactory.CreateReader(stream))
{
while (reader.Read())
{
for (var i = 0; i < reader.FieldCount; i++)
{
var value = reader.GetValue(i)?.ToString();
}
}
}
}
Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget
libraries including EPPlus
) as it does not use any XML-serialization and writes data directly to the file:
using (var ew = new ExcelWriter("C:\\temp\\test.xlsx"))
{
for (var row = 1; row <= 100; row++)
{
for (var col = 1; col <= 10; col++)
{
ew.Write($"row:{row}-col:{col}", col, row);
}
}
}
Upvotes: 7