pedrodotnet
pedrodotnet

Reputation: 818

C# Process CSV file

I have the following CSV file with this header:

AccountOwnerEmail PartnerName EnrollmentID Customer LicensingProgram Country Culture Issue

with lines like this:

[email protected],"HEY"? Tester, 12345789,"Catalysis", LLC., Enterprise 6 TEST, etc,etc ,etc

I have a method to separate the lines into the corresponding columns:

var columns = columnsRegex.Matches(line)
                          .Cast<Match>()
                          .Select(m=> m.Value.Trim('\"', '\'', ' ', '\t'))
                          .ToList();

Here's the definition for columnsRegex:

private static Regex columnsRegex = new Regex("\"[^\"]*\"|'[^']*'|[^,;]+");

My problem here is that for example the PartnerName content is being separated into 3 columns like "" "Hey" and "?Tester"

I know that the CSV escapes the double quotes with another double quotes. And I have already checked another posts similar to this that recomends to add the reference to Microsoft.VisualBasic but that is not working for me. Its there any other approach to take to correctly process the CSV content?

Upvotes: 3

Views: 2748

Answers (2)

Julo
Julo

Reputation: 1120

EDIT: Added another parser method, fixed line and test parsing output.

I would say, that your regular expression pattern is wrong. It does not allow to use (doubled) " character in " prefixed values. The same problem is for '

internal static class Program
{
  private const string wrongLine = "[email protected],\"HEY\"? Tester, 12345789,\"Catalysis\", LLC., Enterprise 6 TEST, etc,etc ,etc";
  private const string fixedLine = "[email protected],\"\"\"HEY\"\"? Tester\", 12345789,\"Catalysis\", LLC., Enterprise 6 TEST, etc,etc ,etc";

  private static readonly Regex wrongPattern = new Regex("\"[^\"]*\"|'[^']*'|[^,;]+");
  private static readonly Regex fixedPattern = new Regex("((?:\"((?:[^\"]|\"\")*)\")|(?:'((?:[^']|'')*)')|([^,;]*))(?:[,;]|$)");

  private static void Main()
  {
    Console.WriteLine("***  Wrong line: ***");
    Console.WriteLine();
    Parse(wrongLine);

    Console.WriteLine();
    Console.WriteLine();
    Console.WriteLine("***  Fixed line: ***");
    Console.WriteLine();
    Parse(fixedLine);
  }

  private static void Parse(string line)
  {
    Console.WriteLine("--- [Original Regex] ---");

    var matches = wrongPattern.Matches(line);
    for (int i = 0; i < matches.Count; i++)
    {
      Console.WriteLine("'" + matches[i].Value + "'");
    }

    Console.WriteLine();
    Console.WriteLine("--- [Fixed Regex] ---");
    Console.WriteLine();

    matches = fixedPattern.Matches(line);
    for (int i = 0; i < matches.Count; i++)
    {
      Console.WriteLine("'" + GetValue(matches[i]) + "'");
    }

    Console.WriteLine();
    Console.WriteLine("--- [Correct(?) parser] ---");
    Console.WriteLine();

    var position = 0;
    while (position < line.Length)
    {
      var value = GetValue(line, ref position);
      Console.WriteLine("'" + value + "'");
    }
  }

  private static string GetValue(Match match)
  {
    if (!string.IsNullOrEmpty(match.Groups[2].Value))
    {
      return (match.Groups[2].Value.Replace("\"\"", "\""));
    }

    if (!string.IsNullOrEmpty(match.Groups[3].Value))
    {
      return (match.Groups[3].Value.Replace("''", "'"));
    }

    return (match.Groups[4].Value.Replace("''", "'"));
  }

  private static string GetValue(string line, ref int position)
  {
    string value;
    char? prefix;
    string endWith;
    switch (line[position])
    {
    case '\'':
    case '\"':
      prefix = line[position];
      endWith = prefix + ",";
      position++;
      break;

    default:
      prefix = null;
      endWith = ",";
      break;
    }

    var endPosition = line.IndexOf(endWith, position);
    if (endPosition < 0 && prefix.HasValue)
    {
      if (line[line.Length - 1] == prefix.Value)
      {
        value = line.Substring(position, line.Length - 1 - position);
        position = line.Length;
        return Fixprefix(value, prefix.Value.ToString());
      }

      position--;
      endPosition = line.IndexOf(',', position);
    }

    if (endPosition < 0)
    {
      value = line.Substring(position);
      position = line.Length;
      return value;
    }

    if (prefix.HasValue)
    {
      value = line.Substring(position, endPosition - position);
      position = endPosition + endWith.Length;
      return Fixprefix(value, prefix.Value.ToString());
    }

    value = line.Substring(position, endPosition - position);
    position = endPosition + endWith.Length;
    return value;
  }

  private static string Fixprefix(string value, string prefix) => value.Replace(prefix + prefix, prefix);
}

The 'fixed Regex pattern' still has a bug, but I leave it at current state...

(Figure your self how to fix this parsing.)

Parser test

Output window:

***  Wrong line: ***

--- [Original Regex] ---
'[email protected]'
'"HEY"'
'? Tester'
' 12345789'
'"Catalysis"'
' LLC.'
' Enterprise 6 TEST'
' etc'
'etc '
'etc'

--- [Fixed Regex] ---

'[email protected]'
'"HEY"? Tester'
' 12345789'
'Catalysis'
' LLC.'
' Enterprise 6 TEST'
' etc'
'etc '
'etc'
''

--- [Correct(?) parser] ---

'[email protected]'
'HEY"? Tester, 12345789,"Catalysis'
' LLC.'
' Enterprise 6 TEST'
' etc'
'etc '
'etc'


***  Fixed line: ***

--- [Original Regex] ---
'[email protected]'
'""'
'"HEY"'
'"? Tester"'
' 12345789'
'"Catalysis"'
' LLC.'
' Enterprise 6 TEST'
' etc'
'etc '
'etc'

--- [Fixed Regex] ---

'[email protected]'
'"HEY"? Tester'
' 12345789'
'Catalysis'
' LLC.'
' Enterprise 6 TEST'
' etc'
'etc '
'etc'
''

--- [Correct(?) parser] ---

'[email protected]'
'"HEY"? Tester'
' 12345789'
'Catalysis'
' LLC.'
' Enterprise 6 TEST'
' etc'
'etc '
'etc'

Upvotes: 1

Andrei
Andrei

Reputation: 44600

I use CsvHelper for it. It's a very nice library to parse CSV documents. Use nuget package:

Install-Package CsvHelper

Documentation can be found here.

var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyCsvRecord>();

Where MyCsvRecord is your CSV row e.g.:

public class MyCsvRecord
{
    public string AccountOwnerEmail { get; set; }
    public string PartnerName { get; set; }
    // etc.
}

Upvotes: 3

Related Questions