Less White
Less White

Reputation: 589

Split a line using std::regex and discard empty elements

I need to split a line based on two separators: ' ' and ;.

By example:

input : " abc  ; def  hij  klm  "
output: {"abc","def","hij","klm"}

How can I fix the function below to discard the first empty element?

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    return std::vector<std::string>(rit, std::sregex_token_iterator());
}

// input : " abc  ; def  hij  klm  "
// output: {"","abc","def","hij","klm"}

Below a complete sample that compiles:

#include <iostream>
#include <string>
#include <vector>
#include <regex>

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    return std::vector<std::string>(rit, std::sregex_token_iterator());
}

int main()
{
    std::string line = " abc  ; def  hij  klm  ";
    std::cout << "input: \"" << line << "\"" << std::endl;

    auto collection = Split(line);

    std::cout << "output: {";
    auto bComma = false;
    for (auto oneField : collection)
    {
        std::cout << (bComma ? "," : "") << "\"" << oneField << "\"";
        bComma = true;
    }
    std::cout << "} " << std::endl;
}

Upvotes: 4

Views: 695

Answers (4)

Less White
Less White

Reputation: 589

In case someone wants to copy the function revised based on the Jerry Coffin input using std::remove_copy_if:

std::vector<std::string> SplitLine(std::string const& line, const std::regex seps) 
{
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    std::vector<std::string> tokens;
    std::remove_copy_if(rit, std::sregex_token_iterator(),
        std::back_inserter(tokens),
        [](std::string const &s) { return s.empty(); });
    return tokens;
}

Upvotes: 0

Jerry Coffin
Jerry Coffin

Reputation: 490098

I can see a couple possibilities beyond what's been mentioned in the other questions so far. The first would be to use std::remove_copy_if when building your vector:

// regex stuff here
std::vector<std::string> tokens;
std::remove_copy_if(rit, std::sregex_token_iterator(), 
                    std::back_inserter(tokens),
                    [](std::string const &s) { return s.empty(); });

Another possibility would be to create a locale that classified characters appropriately, and just read from there:

struct reader: std::ctype<char> {
    reader(): std::ctype<char>(get_table()) {}
    static std::ctype_base::mask const* get_table() {
        static std::vector<std::ctype_base::mask> rc(table_size, std::ctype_base::mask());

        rc[' '] = std::ctype_base::space;
        rc[';'] = std::ctype_base::space;

        // at a guess, newlines are probably still separators too:
        rc['\n'] = std::ctype_base::space;
        return &rc[0];
    }
};

Once we have this, we tell the stream to use that locale when reading from (or writing to) the stream:

std::stringstream input(" abc  ; def  hij  klm  ");

input.imbue(std::locale(std::locale(), new reader));

Then we probably want to clean up the code for inserting commas only between tokens, rather than after every token. Fortunately, I wrote some code to handle that fairly neatly some time ago. Using it, we can copy tokens from the input above to standard output fairly simply:

std::cout << "{ ";
std::copy(std::istream_iterator<std::string>(input), {}, 
    infix_ostream_iterator<std::string>(std::cout, ", "));  
std::cout << " }";

Result: "{ abc, def, hij, klm }", exactly as you'd expect/hope for--without any extra kludges to make up for its starting out doing the wrong thing.

Upvotes: 3

NathanOliver
NathanOliver

Reputation: 180490

If you do not want to remove the elements from the vector after you populate it you can also traverse the iterator range and build the vector skipping the empty matches like

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1), end;
    std::vector<std::string> tokens;
    for(;rit != end; ++rit);
        if (rit->length() != 0)
            tokens.push_back(*rit)
    return tokens;
}

Upvotes: 1

Cory Kramer
Cory Kramer

Reputation: 117856

You could always add an extra step at the end of the function to prune out the empty strings altogether, using the erase-remove idiom

std::vector<std::string> Split(std::string const& line) {
    std::regex seps("[ ;]+");
    std::sregex_token_iterator rit(line.begin(), line.end(), seps, -1);
    auto tokens = std::vector<std::string>(rit, std::sregex_token_iterator());
    tokens.erase(std::remove_if(tokens.begin(),
                                tokens.end(),
                                [](std::string const& s){ return s.empty(); }),
                 tokens.end());
    return tokens;
}

Upvotes: 2

Related Questions