Frank R. Haugen
Frank R. Haugen

Reputation: 240

How to efficiently download, read and process CSV in C#

I'm working on a service that will collect a large CSV-file from an online resource, then as it's downloading, read the lines, (preferably in batches), and send them to a database. This should not use more than 256MB of RAM at any time, and not save a file to disk.

This is for a service that will run once every 7 days, and collect all the companies in the Norwegian Company Register, (a nifty, 250MB, 1.1 million line CSV is found here: http://hotell.difi.no/download/brreg/enhetsregisteret )

My application can easily download the file and add it to a List<>, and process it, but it uses 3.3 GB of RAM

public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
{
    var request = await _httpClient.GetAsync(_options.Value.Urls["BrregCsv"]);

    request.EnsureSuccessStatusCode();

    using (var stream = await request.Content.ReadAsStreamAsync())
    using (var streamReader = new StreamReader(stream))
    {
        while (!streamReader.EndOfStream)
        {
            using (var csv = new CsvReader(streamReader)) // CsvReader is from the CsvHelper -nuget
            {
                csv.Configuration.Delimiter = ";";
                csv.Configuration.BadDataFound = null;
                csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();

                await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>().ToList());
            }
        }
    }

    return true;
}

Small note on the SqlRepository: I've replaced it with a simple "destroyer"-method that just clears the data, so as to not use any extra resources while debugging

What I'd expect is that the Garbage Collector would "destroy" the resources used as the lines of the file is processed, but it doesn't.

Put simply, I want the following to happen: As the CSV downloads, it reads a few lines, these are then sent to a method, and the lines in memory are then flushed

I'm definitely inexperienced at working with large datasets, so I'm working off other people's work, and not getting the results I expect

Thank you for your time and assistance

Upvotes: 1

Views: 2053

Answers (2)

Frank R. Haugen
Frank R. Haugen

Reputation: 240

Another solution I'm now implementing, that is more predictable in it's resource use is this:

public async Task<bool> CollectAndUpdateNorwegianCompanyRegistryAlternate()
{
    using (var stream = await _httpClient.GetStreamAsync(_options.Value.Urls["BrregCsv"]))
    using (var reader = new StreamReader(stream))
    using (var csv = new CsvReader(reader))
    {
        csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();
        csv.Configuration.Delimiter = ";";
        csv.Configuration.BadDataFound = null;

        var tempList = new List<NorwegianCompany>();

        while (csv.Read())
        {
            tempList.Add(csv.GetRecord<NorwegianCompany>());

            if (tempList.Count() > 50000)
            {
                await Task.Factory.StartNew(() => _sqlRepository.UpdateNorwegianCompaniesTable(tempList));

                tempList.Clear();
            }
        }
    }
    return true;
}

Now it uses 3 minutes, but never peak 200MB and uses 7-12% CPU, even when doing a SQL "bulk updates", (SqlBulkTool -NuGet is excellent for my needs here), every X lines

Upvotes: 0

Frank R. Haugen
Frank R. Haugen

Reputation: 240

So getting some pointers from Sami Kuhmonen (@sami-kuhmonen) helped, and here's what I came up with this:

public async Task<bool> CollectAndUpdateNorwegianCompanyRegistry()
{
    using (var stream = await _httpClient.GetStreamAsync(_options.Value.Urls["BrregCsv"]))
    using (var streamReader = new StreamReader(stream))
    using (var csv = new CsvReader(streamReader))
    {
        csv.Configuration.Delimiter = ";";
        csv.Configuration.BadDataFound = null;
        csv.Configuration.RegisterClassMap<NorwegianCompanyClassMap>();

        await _sqlRepository.UpdateNorwegianCompaniesTable(csv.GetRecords<NorwegianCompany>());
    }

    return true;
}

It downloads the entire file and sends it to the SqlRepository in 20 seconds, never surpassing 15% CPU, or 30MB RAM

Now, my next challenge is the SqlRepository, but this issue is solved

Upvotes: 3

Related Questions