C++ trying to read in malformed CSV with erroneous commas

Question

I am trying to make a simple CSV file parser to transfer a large number of orders from an order system to an invoicing system. The issue is that the CSV which i am downloading has erroneous commas which are sometimes present in the name field and so this throws the whole process off.

The company INSISTS, which is really starting to piss me off, that they are simply copying data they receive into the CSV and so it's valid data.

Excel mostly seems to interpret this correctly or at least puts the data in the right field, my program however doesn't. I opened the CSV in notepad++ and there is no quotes around strings just raw string separated by commas.

This is currently how i am reading the file.

  int main()
  {
    string t;
    getline(cin, t);
    string Output;
    string path = "in.csv";
    ifstream input(path);
    vstring readout;
    vstring contact, InvoiceNumber, InvoiceDate, DueDate, Description, Quantity, UnitAmount, AccountCode, TaxType, Currency, Allocator, test, Backup, AllocatorBackup;
    vector read, add, total;
    if (input.is_open()) {
        for (string line; getline(input, line); ) {
            auto arr = explode(line, ',');
            contact.push_back(arr[7]); // Source site is the customer in this instance.
            InvoiceNumber.push_back(arr[0]); // OrderID will be invoice number
            InvoiceDate.push_back(arr[1]); // Perchase date
            DueDate.push_back(arr[1]); // Same as order date
            Description.push_back(arr[0]);
            Quantity.push_back(arr[0]);
            UnitAmount.push_back(arr[10]); // The Total
            AccountCode.push_back(arr[7]); // Will be set depending on other factors - But contains the site of perchase
            Currency.push_back(arr[11]); // EUR/GBP
            Allocator.push_back(arr[6]); // This will decide the VAT treatment normally. 
            AllocatorBackup.push_back(arr[5]); // This will decide VAT treatment if the column is off by one.
            Backup.push_back(arr[12]);
            TaxType = Currency;
        }
    }
      return 0;
  }

  vstring explode(string const & s, char delim) {
    vstring result;
    istringstream q(s);
    for (string token; getline(q, token, delim); ) {
        result.push_back(move(token));
    }
    return result;
  }

Vstring is a compiler macro i created to save me typing vector so often, so it's the same thing.

The issue is when i come across one of the fields with the comma in it (normally the name field which is [3]) it of cause pushes everything back by one so account code becomes [8] etc.. This is extremely troublesome as it's difficult to tell weather or not i am dealing with correct data in the next field or not in some cases.

So two questions:

1) Is there any simple way in which i could detect this anomaly and correct for it that i've missed? I of cause do try to check in my loop where i can if valid data is where it's expected to be, but this is becoming messy and does not cope with more than one comma.

2) Is the company correct in telling me that it's "Expected behavior" to allow commas entered by a customer to creep into this CSV without being processed or have they completely misunderstood the CSV "standard"?

DoMakeSayThink · Accepted Answer

Retired Ninja mentioned in the comments that one constraint would be to parse all fields either side of the 'problem field' first, and then put the remaining data into the problem field. This is the best approach if you know which field might contain corruption. If you don't know which field could be corrupted, you still have options though!

You know:

The number of fields that should be present
Something about the type of data in each of those fields.

If you codify the types of the fields (implement classes for different data types, so your vectors of strings would become vectors of OrderIDs or Dates or Counts or....), you can test different concatenations (joining adjacent fields that are separated by a comma) and score them according to how many of the fields pass some data validation. You then choose the best scoring interpretation of the data. This would build some data validation into the process, and make everything a bit more robust.

C++ trying to read in malformed CSV with erroneous commas

Answers (2)

Related Questions