How to process extremely large .xlsx files with C#

Question

Situation I need to solve:

My client has some extremely large .xlsx files that resemble a database table (each row is a record, cols are fields)

I need to help them process those files (search, filter, etc).

By large I mean the smallest of them has 1 million records.

What I have tried:

SheetJS, and NPOI: both libs only reply with a simple "file too large".
EPPlus: can read files up to some hundred K records, but when faced with actual file it just give me a System.OverflowException, my guess is that it's basically out of memory, because a 200MB xlsx file already took me 4GB of memory to read.
I didn't try Microsoft OleDB, but I'd rather avoid it, since I don't want to purchase Microsoft Office just for a job.

Due to confidentiality I cannot share the actual file, but you can easily create a similar structure with 60 cols (first name, last name, dob, etc), and about 1M records.

The question would be solved as soon as you can read an .xlsx file with that criteria, remove half of the records then write to another place without facing memory issue.

Time is not too much of an issue. User is willing to wait an hour or 2 for result if needed.
Memory seem to be the issue currently. This is a personal request, and the client's machine is a laptop capped at 8GB RAM.
csv is not an option here. My client has .xlsx input and need .xlsx output.
Language choice is preferably JS, C# for Python, since I already know how to create executable with them (well can't tell an accountant to learn terminal, can we?).

It would be great if there is a way to slowly read small chunks of data from the file row-by-row, but solutions I have found only read the entire file at the same time.

Roman.Pavelko · Accepted Answer

For reading Excel file I would recommend ExcelDataReader. It does very fine with reading large files. I personally tried 500k-1M:

using (var stream = File.Open("C:\temp\input.xlsx", FileMode.Open, FileAccess.Read))
{
    using (var reader = ExcelReaderFactory.CreateReader(stream))
    {
        while (reader.Read())
        {
            for (var i = 0; i < reader.FieldCount; i++)
            {
                var value = reader.GetValue(i)?.ToString();
            }
        }
    }
}

Writing data back in the same efficient way is more tricky. I finished up with creating my own SwiftExcel library that is extremely fast and efficient (there is a performance chart comparing to other Nuget libraries including EPPlus) as it does not use any XML-serialization and writes data directly to the file:

using (var ew = new ExcelWriter("C:\temp\test.xlsx"))
{
    for (var row = 1; row <= 100; row++)
    {
        for (var col = 1; col <= 10; col++)
        {
            ew.Write($"row:{row}-col:{col}", col, row);
        }
    }
}

How to process extremely large .xlsx files with C#

Answers (2)

Related Questions