user3407675
user3407675

Reputation: 117

C++ trying to read in malformed CSV with erroneous commas

I am trying to make a simple CSV file parser to transfer a large number of orders from an order system to an invoicing system. The issue is that the CSV which i am downloading has erroneous commas which are sometimes present in the name field and so this throws the whole process off.

The company INSISTS, which is really starting to piss me off, that they are simply copying data they receive into the CSV and so it's valid data.

Excel mostly seems to interpret this correctly or at least puts the data in the right field, my program however doesn't. I opened the CSV in notepad++ and there is no quotes around strings just raw string separated by commas.

This is currently how i am reading the file.

  int main()
  {
    string t;
    getline(cin, t);
    string Output;
    string path = "in.csv";
    ifstream input(path);
    vstring readout;
    vstring contact, InvoiceNumber, InvoiceDate, DueDate, Description, Quantity, UnitAmount, AccountCode, TaxType, Currency, Allocator, test, Backup, AllocatorBackup;
    vector<int> read, add, total;
    if (input.is_open()) {
        for (string line; getline(input, line); ) {
            auto arr = explode(line, ',');
            contact.push_back(arr[7]); // Source site is the customer in this instance.
            InvoiceNumber.push_back(arr[0]); // OrderID will be invoice number
            InvoiceDate.push_back(arr[1]); // Perchase date
            DueDate.push_back(arr[1]); // Same as order date
            Description.push_back(arr[0]);
            Quantity.push_back(arr[0]);
            UnitAmount.push_back(arr[10]); // The Total
            AccountCode.push_back(arr[7]); // Will be set depending on other factors - But contains the site of perchase
            Currency.push_back(arr[11]); // EUR/GBP
            Allocator.push_back(arr[6]); // This will decide the VAT treatment normally. 
            AllocatorBackup.push_back(arr[5]); // This will decide VAT treatment if the column is off by one.
            Backup.push_back(arr[12]);
            TaxType = Currency;
        }
    }
      return 0;
  }

  vstring explode(string const & s, char delim) {
    vstring result;
    istringstream q(s);
    for (string token; getline(q, token, delim); ) {
        result.push_back(move(token));
    }
    return result;
  }

Vstring is a compiler macro i created to save me typing vector so often, so it's the same thing.

The issue is when i come across one of the fields with the comma in it (normally the name field which is [3]) it of cause pushes everything back by one so account code becomes [8] etc.. This is extremely troublesome as it's difficult to tell weather or not i am dealing with correct data in the next field or not in some cases.

So two questions:

1) Is there any simple way in which i could detect this anomaly and correct for it that i've missed? I of cause do try to check in my loop where i can if valid data is where it's expected to be, but this is becoming messy and does not cope with more than one comma.

2) Is the company correct in telling me that it's "Expected behavior" to allow commas entered by a customer to creep into this CSV without being processed or have they completely misunderstood the CSV "standard"?

Upvotes: 1

Views: 227

Answers (2)

DoMakeSayThink
DoMakeSayThink

Reputation: 165

Retired Ninja mentioned in the comments that one constraint would be to parse all fields either side of the 'problem field' first, and then put the remaining data into the problem field. This is the best approach if you know which field might contain corruption. If you don't know which field could be corrupted, you still have options though!

You know:

  1. The number of fields that should be present

  2. Something about the type of data in each of those fields.

If you codify the types of the fields (implement classes for different data types, so your vectors of strings would become vectors of OrderIDs or Dates or Counts or....), you can test different concatenations (joining adjacent fields that are separated by a comma) and score them according to how many of the fields pass some data validation. You then choose the best scoring interpretation of the data. This would build some data validation into the process, and make everything a bit more robust.

Upvotes: 1

Earinor
Earinor

Reputation: 35

'csv' is not that well defined. There is the standard way, where ',' seperates the columns and '\n' the rows. Sometimes ' " ' is used to handle these symbols inside a field. But Excel includes them only if a Control Character is involved.

Here the definition from Wiki.

RFC 4180 formalized CSV. It defines the MIME type "text/csv", and CSV files that follow its rules should be very widely portable. Among its requirements:

-MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).

-An optional header record (there is no sure way to detect whether it is present, so care is required when importing).

-Each record "should" contain the same number of comma-separated fields.

-Any field may be quoted (with double quotes).

-Fields containing a line-break, double-quote or commas should be quoted. (If > they are not, the file will likely be impossible to process correctly).

-A (double)quote character in a field must be represented by two (double) quote > characters. Comma-separated values

Keep in mind that Excel has different settings on different systems/system language settings. It might be, that their Excel is parsing it correctly, but somewhere else it isn't.

For Example, in countries like Germany there is ';' used to seperate the columns. The decimal seperators differ as well.

1.5 << english

1,5 << german

Same goes for the thousand seperator.

1,000,000 << english

1.000.000 << german

or

1 000 000 << also german

Now, Excel also has different csv export settings like .csv(Seperated values), .csv(MACINTOSH) and .csv(MS-DOS) so I guess there can be differences too.

Now for your questions, in my opinion they are not clearly wrong with what they are doing with their files. But you should think about discussing about a (E)BNF with them. Here some Links:

BNF EBNF

It is a grammar on which you decide on and with clear definitions the code should be no problem. I know customers can block something like this, because they don't want to have extra work, but it is simply the best solution. If you want ' " ' in your file, they should provide you somehow. I don't know how they copy their data, but it should also be some kind of program (I don't think they do this per hand?), so your code and their code should use the same (E)BNF which you decide on together with them.

Upvotes: 1

Related Questions