Jonnster
Jonnster

Reputation: 3234

Splitting a line of a csv file into a std::vector?

I have a function that will read a CSV file line by line. For each line, it will split the line into a vector. The code to do this is

    std::stringstream ss(sText);
    std::string item;

    while(std::getline(ss, item, ','))
    {
        m_vecFields.push_back(item);
    }

This works fine except for if it reads a line where the last value is blank. For example,

text1,tex2,

I would want this to return a vector of size 3 where the third value is just empty. However, instead it just returns a vector of size 2. How can I correct this?

Upvotes: 5

Views: 15281

Answers (5)

Jonathan Mee
Jonathan Mee

Reputation: 38949

C++11 makes it exceedingly easy to handle even escaped commas using regex_token_iterator:

std::stringstream ss(sText);
std::string item;
const regex re{"((?:[^\\\\,]|\\\\.)*?)(?:,|$)"};

std::getline(ss, item)

m_vecFields.insert(m_vecFields.end(), sregex_token_iterator(item.begin(), item.end(), re, 1), sregex_token_iterator());

Incidentally if you simply wanted to construct a vector<string> from a CSV string such as item you could just do:

const regex re{"((?:[^\\\\,]|\\\\.)*?)(?:,|$)"};
vector<string> m_vecFields{sregex_token_iterator(item.begin(), item.end(), re, 1), sregex_token_iterator()};

[Live Example]

Some quick explanation of the regex is probably in order. (?:[^\\\\,]|\\\\.) matches escaped characters or non-',' characters. (See here for more info: https://stackoverflow.com/a/7902016/2642059) The *? means that it is not a greedy match, so it will stop at the first ',' reached. All that's nested in a capture, which is selected by the last parameter, the 1, to regex_token_iterator. Finally, (?:,|$) will match either the ','-delimiter or the end of the string.

To make this standard CSV reader ignore empty elements, the regex can be altered to only match strings with more than one character.

const regex re{"((?:[^\\\\,]|\\\\.)+?)(?:,|$)"};

Notice the '+' has now replaced the '*' indicating 1 or more matching characters are required. This will prevent it from matching your item string that ends with a ','. You can see an example of this here: http://ideone.com/W4n44W

Upvotes: 2

Markoj
Markoj

Reputation: 243

Flexible solution for parsing csv files: where:

source - content of CSV file

delimeter - CSV delimeter eg. ',' ';'

std::vector<std::string> csv_split(std::string source, char delimeter) {
    std::vector<std::string> ret;
    std::string word = "";
    int start = 0;

    bool inQuote = false;
    for(int i=0; i<source.size(); ++i){
        if(inQuote == false && source[i] == '"'){
            inQuote = true;
            continue;
        }
        if(inQuote == true && source[i] == '"'){
            if(source.size() > i && source[i+1] == '"'){
                ++i;
            } else {
                inQuote = false;
                continue;
            }
        }

        if(inQuote == false && source[i] == delimeter){
            ret.push_back(word);
            word = "";
        } else {
            word += source[i];
        }
    }
    ret.push_back(word);

    return ret;
}

Upvotes: 2

GrahamS
GrahamS

Reputation: 10350

You could just use boost::split to do all this for you.
http://www.boost.org/doc/libs/1_50_0/doc/html/string_algo/usage.html#id3207193

It has the behaviour that you require in one line.

Example boost::split Code

#include <iostream>
#include <vector>
#include <boost/algorithm/string.hpp>

using namespace std;

int main()
{
    vector<string> strs;

    boost::split(strs, "please split,this,csv,,line,", boost::is_any_of(","));

    for ( vector<string>::iterator it = strs.begin(); it < strs.end(); it++ )
        cout << "\"" << *it << "\"" << endl;

    return 0;
}

Results

"please split"
"this"
"csv"
""
"line"
""

Upvotes: 4

Julien Lebot
Julien Lebot

Reputation: 3092

You can use a function similar to this:

template <class InIt, class OutIt>
void Split(InIt begin, InIt end, OutIt splits)
{
    InIt current = begin;
    while (begin != end)
    {
        if (*begin == ',')
        {
            *splits++ = std::string(current,begin);
            current = ++begin;
        }
        else
            ++begin;
    }
    *splits++ = std::string(current,begin);
}

It will iterate through the string and whenever it encounters the delimiter, it will extract the string and store it in the splits iterator.
The interesting part is

  • when current == begin it will insert an empty string (test case: "text1,,tex2")
  • the last insertion guarantees there will always be the correct number of elements.
    If there is a trailing comma, it will trigger the previous bullet point and add an empty string, otherwise it will add the last element to the vector.

You can use it like this:

std::stringstream ss(sText);
std::string item;
std::vector<std::string> m_vecFields;
while(std::getline(ss, item))
{
    Split(item.begin(), item.end(), std::back_inserter(m_vecFields));
}

std::for_each(m_vecFields.begin(), m_vecFields.end(), [](std::string& value)
{
    std::cout << value << std::endl;
});

Upvotes: 2

jrok
jrok

Reputation: 55425

bool addEmptyLine = sText.back() == ',';

/* your code here */

if (addEmptyLine) m_vecFields.push_back("");

or

sText += ',';     // text1, text2,,

/* your code */

assert(m_vecFields.size() == 3);

Upvotes: 2

Related Questions