RHarris
RHarris

Reputation: 11207

How to read uploaded CSV UTF-8 for processing with CsvHelper?

My WebAPI allows a user to upload a CSV file and then parses the file. I use CsvHelper to do the heavy lifting of reading the CSV and mapping it to domain objects.

However, I have one customer who's files are in CSV UTF-8 format. The code that works for "vanilla" (ASCII) CSV files hurls when it tries to deal with CSV UTF-8.

Is there a way to import the CSV UTF-8 data and convert it to ASCII CSV so that my code will continue to work?

My current code looks like this:

//In my WebAPI Controller
//fileToProcess is IFormFile
byte[] fileBytes = new byte[fileToProcess.Length];
using(var stream = fileToProcess.OpenReadStream())
{
    await stream.ReadAsync(fileBytes);
    stream.Close();
}

var result = await ProcessFileAsync(fileBytes);

return OK(result);
...

//In a Parsing Class
public async Task<List<Client>> ProcessFileAsync(byte[] fileBytes)
{
    List<Client> result = null;
    var fileText = Encoding.Default.GetString(fileBytes);
    using(var reader = new StringReader(fileText))
    {
       using(var csv = new CsvReader(reader))
       {
          csv.RegisterClassMap<ClientMap>();
          result = csv.GetRecords<T>().ToList();
          await PostProcess(result);
       }
    }

    return result;
 }

The problem is that CSV UTF-8 has the BOM so when CsvHelper tries to process a mapping that references the first column header

Map(c => c.ClientId).Name("CLIENT ID");

it fails because the column name includes the BOM.

So, my questions are:

  1. How can I tell if the file coming in is UTF-8 or ASCII.
  2. How do I convert the UTF-8 to ASCII so it can be processed normally?

NOTE

I did try the following:

fileBytes = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, fileBytes);

However, this replaced the BOM with a ? which still causes CsvHelper to fail.

Upvotes: 1

Views: 6581

Answers (1)

madreflection
madreflection

Reputation: 4957

By doing this:

var fileText = Encoding.Default.GetString(fileBytes);
using(var reader = new StringReader(fileText))

... you're locking yourself into a specific encoding at the point of converting it to a string. Encoding.Default is can vary by platform and CLR implementation.

The StreamReader class is designed to read text from a stream (which you can wrap around the raw bytes with a MemoryStream) and is capable of detecting the encoding for you if you let it. Try this instead:

using (var stream = new MemoryStream(fileBytes))
using (var reader = new StreamReader(stream))

In your case, you could use the incoming stream directly by changing ProcessFileAsync to accept the stream.

using (var stream = fileToProcess.OpenReadStream())
{
    var result = await ProcessFileAsync(stream);

    return OK(result);
}
public async Task<List<Client>> ProcessFileAsync(Stream stream)
{
    using (var reader = new StreamReader(stream))
    {
       using (var csv = new CsvReader(reader))
       {
           csv.RegisterClassMap<ClientMap>();
           List<Client> result = csv.GetRecords<Client>().ToList();
           await PostProcess(result);
           return result;
       }
    }
}

As long as the BOM is present, this will also support UTF16-encoded and UTF32-encoded files (and pretty much anything else that can be detected) because it'll see the U+FEFF code point in whichever encoding it uses.

Upvotes: 1

Related Questions