nullByteMe
nullByteMe

Reputation: 6391

Parsing a csv with comma in field

I'm trying to create an object using a csv with the below data

Alonso,Fernando,21,31,29,2,Racing
Dhoni,Mahendra Singh,22,30,4,26,Cricket
Wade,Dwyane,23,29.9,18.9,11,Basketball
Anthony,Carmelo,24,29.4,21.4,8,Basketball
Klitschko,Wladimir,25,28,24,4,Boxing
Manning,Peyton,26,27.1,15.1,12,Football
Stoudemire,Amar'e,27,26.7,21.7,5,Basketball
"Earnhardt, Jr.",Dale,28,25.9,14.9,11,Racing
Howard,Dwight,29,25.5,20.5,5,Basketball
Lee,Cliff,30,25.3,25.1,0.2,Baseball
Mauer,Joe,31,24.8,23,1.8,Baseball
Cabrera,Miguel,32,24.6,22.6,2,Baseball
Greinke,Zack,33,24.5,24.4,50,Baseball
Sharapova,Maria,34,24.4,2.4,22,Tennis
Jeter,Derek,35,24.3,15.3,9,Baseball

I'm using the following code to parse it:

void AthleteDatabase::createDatabase(void)
{
    ifstream inFile(INPUT_FILE.c_str());
    string inputString;

    if(!inFile)
    {
        cout << "Error opening file for input: " << INPUT_FILE << endl;
    }
    else
    {
        getline(inFile, inputString);
        while(inFile)
        {
            istringstream s(inputString);
            string field;
            string athleteArray[7];
            int counter = 0;
            while(getline(s, field, ','))
            {
                athleteArray[counter] = field;
                counter++;
            }

            string lastName = athleteArray[0];
            string firstName = athleteArray[1];
            int rank = atoi(athleteArray[2].c_str());
            float totalEarnings = strtof(athleteArray[3].c_str(), NULL);
            float salary = strtof(athleteArray[4].c_str(), NULL);
            float endorsements = strtof(athleteArray[5].c_str(), NULL);
            string sport = athleteArray[6];

            Athlete anAthlete(lastName, firstName, rank,
                              totalEarnings, salary, endorsements, sport);
            athleteDatabaseBST.add(anAthlete);
            display(anAthlete);
            getline(inFile, inputString);
        }
        inFile.close();
    }
}

My code breaks on the line:

"Earnhardt, Jr.",Dale,28,25.9,14.9,11,Racing

obviously because of the quotes. Is there a better way to handle this? I'm still extremely new to C++ so any assistance would be greatly appreciated.

Upvotes: 1

Views: 3801

Answers (4)

Ilmari Karonen
Ilmari Karonen

Reputation: 50368

I'd recommend just using a proper CSV parser. You can find some in the answers to this earlier question, or just search for one on Google.

If you insist on rolling your own, it's probably easiest to just get down to the basics and design it as a finite state machine that processes the input one character at a time. With a one-character look-ahead, you basically need two states: "reading normal input" and "reading a quoted string". If you don't want to use look-ahead, you can do this with a couple more states, e.g. like this:

  • initial state: If next character is a quote, switch to state quoted field; else behave as if in state unquoted field.

  • unquoted field: If next character is EOF, end parsing; else, if it is a newline, start a new row and switch to initial state; else, if it is a separator (comma), start a new field in the same row and switch to initial state; else append the character to the current field and remain in state unquoted field. (Optionally, if the character is a quote, signal a parse error.)

  • quoted field: If next character is EOF, signal parse error; else, if it is a quote, switch to state end quote; else append the character to the current field and remain in state quoted field.

  • end quote: If next character is a quote, append it to the current field and return to state quoted field; else, if it is a comma or a newline or EOF, behave as if in state unquoted field; else signal parse error.

(This is for "traditional" CSV, as described e.g. in RFC 4180, where quotes in quoted fields are escaped by doubling them. Adding support for backslash-escapes, which are used in some fairly common variants of the CSV format, is left as an exercise. It requires one or two more states, depending on whether you want to to support backslashes in quoted or unquoted strings or both, and whether you want to support both traditional and backslash escapes at the same time.)

In a high-level scripting language, such character-by-character iteration would be really inefficient, but since you're writing C++, all it needs to be blazing fast is some half-decent I/O buffering and a reasonably efficient string append operation.

Upvotes: 3

Eric Lease
Eric Lease

Reputation: 4194

I agree with Imari's answer, why re-invent the wheel? That being said, have you considered regex? I believe this answer can be used to accomplish what you want and then some.

Upvotes: 0

Barry
Barry

Reputation: 303337

Simple answer: use a different delimiter. Everything's a lot easier to parse if you use something like '|' instead:

Stoudemire,Amar'e|27|26.7|21.7|5|Basketball
Earnhardt, Jr.|Dale|28|25.9|14.9|11|Racing

The advantage there being any other app that might need to parse your file can also do it just as cleanly.

If sticking with commas is a requirement, then you'd have to conditionally grab a field based on its first char:

std::istream& nextField(std::istringstream& s, std::string& field)
{
    char c;
    if (s >> c) {
        if (c == '"') {
            // using " as the delimeter
            getline(s, field, '"');
            return s >> c; // for the subsequent comma
                           // could potentially assert for error-checking
        }
        else if (c == ',') {
            // handle empty field case
            field = "";
        }
        else {
            // normal case, but prepend c
            getline(s, field, ',');
            field = c + field;
        }
    }

    return s;
}

Used as a substitute for where you have getline:

while (nextField(s, field)) {
    athleteVec.push_back(field); // prefer vector to array
}

Could even simplify that logic a bit by just continuing to use getline if we have an unterminated quoted string:

std::istream& nextField(std::istringstream& s, std::string& field)
{
    if (std::getline(s, field, ',')) {
        while (s && field[0] == '"' && field[field.size() - 1] != '"') {
            std::string next;
            std::getline(s, next, ',');
            field += ',' + next;
        }

        if (field[0] == '"' && field[field.size() - 1] == '"') {
            field = field.substr(1, field.size() - 2);
        }
    }

    return s;
}

Upvotes: 1

Sam Varshavchik
Sam Varshavchik

Reputation: 118425

You have to parse each line character by character, using a bool flag, and a std::string that accumulates the contents of the next field; instead of just plowing ahead to the next comma, as you did.

Initially, the bool flag is false, and you iterate over the entire line, character by character. The quote character flips the bool flag. The comma character, only when the bool flag is false takes the accumulated contents of the std::string and saves it as the next field on the line, and clears the std::string to empty, ready for the next field. Otherwise, the character gets appended to the buffer.

This is a basic outline of the algorithm, with some minor details that you should be able to flesh out by yourself. There are a couple of other ways to do this, that are slightly more efficient, but for a beginner like yourself this kind of an approach would be the easiest to implement.

Upvotes: 1

Related Questions