Bill the Lizard
Bill the Lizard

Reputation: 405675

How do I tokenize a string in C++?

Java has a convenient split method:

String str = "The quick brown fox";
String[] results = str.split(" ");

Is there an easy way to do this in C++?

Upvotes: 480

Views: 670421

Answers (30)

leanid.chaika
leanid.chaika

Reputation: 2432

I just read all the answers and can't find solution with next preconditions:

  1. no dynamic memory allocations
  2. no use of boost
  3. no use of regex
  4. c++17 standard only

So here is my solution

#include <iomanip>
#include <iostream>
#include <iterator>
#include <string_view>
#include <utility>

struct split_by_spaces
{
    std::string_view      text;
    static constexpr char delim = ' ';

    struct iterator
    {
        const std::string_view& text;
        std::size_t             cur_pos;
        std::size_t             end_pos;

        std::string_view operator*() const
        {
            return { &text[cur_pos], end_pos - cur_pos };
        }
        bool operator==(const iterator& other) const
        {
            return cur_pos == other.cur_pos && end_pos == other.end_pos;
        }
        bool operator!=(const iterator& other) const
        {
            return !(*this == other);
        }
        iterator& operator++()
        {
            cur_pos = text.find_first_not_of(delim, end_pos);

            if (cur_pos == std::string_view::npos)
            {
                cur_pos = text.size();
                end_pos = cur_pos;
                return *this;
            }

            end_pos = text.find(delim, cur_pos);

            if (end_pos == std::string_view::npos)
            {
                end_pos = text.size();
            }

            return *this;
        }
    };

    [[nodiscard]] iterator begin() const
    {
        auto start = text.find_first_not_of(delim);
        if (start == std::string_view::npos)
        {
            return iterator{ text, text.size(), text.size() };
        }
        auto end_word = text.find(delim, start);
        if (end_word == std::string_view::npos)
        {
            end_word = text.size();
        }
        return iterator{ text, start, end_word };
    }
    [[nodiscard]] iterator end() const
    {
        return iterator{ text, text.size(), text.size() };
    }
};

int main(int argc, char** argv)
{
    using namespace std::literals;
    auto str = " there should be no memory allocation during parsing"
               "  into words this line and you   should'n create any"
               "  contaner                  for intermediate words  "sv;

    auto comma = "";
    for (std::string_view word : split_by_spaces{ str })
    {
        std::cout << std::exchange(comma, ",") << std::quoted(word);
    }

    auto only_spaces = "                   "sv;
    for (std::string_view word : split_by_spaces{ only_spaces })
    {
        std::cout << "you will not see this line in output" << std::endl;
    }
}

Upvotes: 1

Adam Pierce
Adam Pierce

Reputation: 34345

Here's a real simple one:

#include <vector>
#include <string>

vector<string> split(const char *str, char c = ' ')
{
    std::vector<std::string> result;

    do
    {
        const char *begin = str;

        while(*str != c && *str)
            str++;

        result.push_back(std::string(begin, str));
    } while (0 != *str++);

    return result;
}

Upvotes: 185

user35978
user35978

Reputation: 2362

Another quick way is to use getline. Something like:

std::istringstream iss(str);
std::string s;

while (std::getline(iss, s, ' ')) {
  std::cout << s << std::endl;
}

If you want, you can make a simple split() method returning a std::vector<string>, which is really useful.

Upvotes: 151

Tanzer
Tanzer

Reputation: 320

I wrote a simplified version (and maybe a little bit efficient) of https://stackoverflow.com/a/50247503/3976739 for my own use. I hope it would help.

void StrTokenizer(string& source, const char* delimiter, vector<string>& Tokens)
{   
   size_t new_index = 0;
   size_t old_index = 0;

   while (new_index != std::string::npos)   
   {
      new_index = source.find(delimiter, old_index);
      Tokens.emplace_back(source.substr(old_index, new_index-old_index));

      if (new_index != std::string::npos)
          old_index = ++new_index;
   }
}

Upvotes: 1

einpoklum
einpoklum

Reputation: 131385

If you're using C++ ranges - the full ranges-v3 library, not the limited functionality accepted into C++20 - you could do it this way:

auto results = str | ranges::views::tokenize(" ",1);

... and this is lazily-evaluated. You can alternatively set a vector to this range:

auto results = str | ranges::views::tokenize(" ",1) | ranges::to<std::vector>();

this will take O(m) space and O(n) time if str has n characters making up m words.

See also the library's own tokenization example, here.

Upvotes: 11

Jonathan Mee
Jonathan Mee

Reputation: 38909

Adam Pierce's answer provides an hand-spun tokenizer taking in a const char*. It's a bit more problematic to do with iterators because incrementing a string's end iterator is undefined. That said, given string str{ "The quick brown fox" } we can certainly accomplish this:

auto start = find(cbegin(str), cend(str), ' ');
vector<string> tokens{ string(cbegin(str), start) };

while (start != cend(str)) {
    const auto finish = find(++start, cend(str), ' ');

    tokens.push_back(string(start, finish));
    start = finish;
}

Live Example


If you're looking to abstract complexity by using standard functionality, as On Freund suggests strtok is a simple option:

vector<string> tokens;

for (auto i = strtok(data(str), " "); i != nullptr; i = strtok(nullptr, " ")) tokens.push_back(i);

If you don't have access to C++17 you'll need to substitute data(str) as in this example: http://ideone.com/8kAGoa

Though not demonstrated in the example, strtok need not use the same delimiter for each token. Along with this advantage though, there are several drawbacks:

  1. strtok cannot be used on multiple strings at the same time: Either a nullptr must be passed to continue tokenizing the current string or a new char* to tokenize must be passed (there are some non-standard implementations which do support this however, such as: strtok_s)
  2. For the same reason strtok cannot be used on multiple threads simultaneously (this may however be implementation defined, for example: Visual Studio's implementation is thread safe)
  3. Calling strtok modifies the string it is operating on, so it cannot be used on const strings, const char*s, or literal strings, to tokenize any of these with strtok or to operate on a string who's contents need to be preserved, str would have to be copied, then the copy could be operated on

provides us with split_view to tokenize strings, in a non-destructive manner: https://topanswers.xyz/cplusplus?q=749#a874


The previous methods cannot generate a tokenized vector in-place, meaning without abstracting them into a helper function they cannot initialize const vector<string> tokens. That functionality and the ability to accept any white-space delimiter can be harnessed using an istream_iterator. For example given: const string str{ "The quick \tbrown \nfox" } we can do this:

istringstream is{ str };
const vector<string> tokens{ istream_iterator<string>(is), istream_iterator<string>() };

Live Example

The required construction of an istringstream for this option has far greater cost than the previous 2 options, however this cost is typically hidden in the expense of string allocation.


If none of the above options are flexable enough for your tokenization needs, the most flexible option is using a regex_token_iterator of course with this flexibility comes greater expense, but again this is likely hidden in the string allocation cost. Say for example we want to tokenize based on non-escaped commas, also eating white-space, given the following input: const string str{ "The ,qu\\,ick ,\tbrown, fox" } we can do this:

const regex re{ "\\s*((?:[^\\\\,]|\\\\.)*?)\\s*(?:,|$)" };
const vector<string> tokens{ sregex_token_iterator(cbegin(str), cend(str), re, 1), sregex_token_iterator() };

Live Example

Upvotes: 9

Konrad Rudolph
Konrad Rudolph

Reputation: 545478

C++ standard library algorithms are pretty universally based around iterators rather than concrete containers. Unfortunately this makes it hard to provide a Java-like split function in the C++ standard library, even though nobody argues that this would be convenient. But what would its return type be? std::vector<std::basic_string<…>>? Maybe, but then we’re forced to perform (potentially redundant and costly) allocations.

Instead, C++ offers a plethora of ways to split strings based on arbitrarily complex delimiters, but none of them is encapsulated as nicely as in other languages. The numerous ways fill whole blog posts.

At its simplest, you could iterate using std::string::find until you hit std::string::npos, and extract the contents using std::string::substr.

A more fluid (and idiomatic, but basic) version for splitting on whitespace would use a std::istringstream:

auto iss = std::istringstream{"The quick brown fox"};
auto str = std::string{};

while (iss >> str) {
    process(str);
}

Using std::istream_iterators, the contents of the string stream could also be copied into a vector using its iterator range constructor.

Multiple libraries (such as Boost.Tokenizer) offer specific tokenisers.

More advanced splitting require regular expressions. C++ provides the std::regex_token_iterator for this purpose in particular:

auto const str = "The quick brown fox"s;
auto const re = std::regex{R"(\s+)"};
auto const vec = std::vector<std::string>(
    std::sregex_token_iterator{begin(str), end(str), re, -1},
    std::sregex_token_iterator{}
);

Upvotes: 178

NutCracker
NutCracker

Reputation: 12253

I know this question is already answered but I want to contribute. Maybe my solution is a bit simple but this is what I came up with:

vector<string> get_words(string const& text, string const& separator)
{
    vector<string> result;
    string tmp = text;

    size_t first_pos = 0;
    size_t second_pos = tmp.find(separator);

    while (second_pos != string::npos)
    {
        if (first_pos != second_pos)
        {
            string word = tmp.substr(first_pos, second_pos - first_pos);
            result.push_back(word);
        }
        tmp = tmp.substr(second_pos + separator.length());
        second_pos = tmp.find(separator);
    }

    result.push_back(tmp);

    return result;
}

Please comment if there is a better approach to something in my code or if something is wrong.

UPDATE: added generic separator

Upvotes: 3

DannyK
DannyK

Reputation: 1552

I posted this answer for similar question.
Don't reinvent the wheel. I've used a number of libraries and the fastest and most flexible I have come across is: C++ String Toolkit Library.

Here is an example of how to use it that I've posted else where on the stackoverflow.

#include <iostream>
#include <vector>
#include <string>
#include <strtk.hpp>

const char *whitespace  = " \t\r\n\f";
const char *whitespace_and_punctuation  = " \t\r\n\f;,=";

int main()
{
    {   // normal parsing of a string into a vector of strings
       std::string s("Somewhere down the road");
       std::vector<std::string> result;
       if( strtk::parse( s, whitespace, result ) )
       {
           for(size_t i = 0; i < result.size(); ++i )
            std::cout << result[i] << std::endl;
       }
    }

    {  // parsing a string into a vector of floats with other separators
       // besides spaces

       std::string s("3.0, 3.14; 4.0");
       std::vector<float> values;
       if( strtk::parse( s, whitespace_and_punctuation, values ) )
       {
           for(size_t i = 0; i < values.size(); ++i )
            std::cout << values[i] << std::endl;
       }
    }

    {  // parsing a string into specific variables

       std::string s("angle = 45; radius = 9.9");
       std::string w1, w2;
       float v1, v2;
       if( strtk::parse( s, whitespace_and_punctuation, w1, v1, w2, v2) )
       {
           std::cout << "word " << w1 << ", value " << v1 << std::endl;
           std::cout << "word " << w2 << ", value " << v2 << std::endl;
       }
    }

    return 0;
}

Upvotes: 10

kayleeFrye_onDeck
kayleeFrye_onDeck

Reputation: 6958

Here's my Swiss® Army Knife of string-tokenizers for splitting up strings by whitespace, accounting for single and double-quote wrapped strings as well as stripping those characters from the results. I used RegexBuddy 4.x to generate most of the code-snippet, but I added custom handling for stripping quotes and a few other things.

#include <string>
#include <locale>
#include <regex>

std::vector<std::wstring> tokenize_string(std::wstring string_to_tokenize) {
    std::vector<std::wstring> tokens;

    std::wregex re(LR"(("[^"]*"|'[^']*'|[^"' ]+))", std::regex_constants::collate);

    std::wsregex_iterator next( string_to_tokenize.begin(),
                                string_to_tokenize.end(),
                                re,
                                std::regex_constants::match_not_null );

    std::wsregex_iterator end;
    const wchar_t single_quote = L'\'';
    const wchar_t double_quote = L'\"';
    while ( next != end ) {
        std::wsmatch match = *next;
        const std::wstring token = match.str( 0 );
        next++;

        if (token.length() > 2 && (token.front() == double_quote || token.front() == single_quote))
            tokens.emplace_back( std::wstring(token.begin()+1, token.begin()+token.length()-1) );
        else
            tokens.emplace_back(token);
    }
    return tokens;
}

Upvotes: 1

sivabudh
sivabudh

Reputation: 32635

I know you asked for a C++ solution, but you might consider this helpful:

Qt

#include <QString>

...

QString str = "The quick brown fox"; 
QStringList results = str.split(" "); 

The advantage over Boost in this example is that it's a direct one to one mapping to your post's code.

See more at Qt documentation

Upvotes: 28

vzczc
vzczc

Reputation: 9380

Here is a sample tokenizer class that might do what you want

//Header file
class Tokenizer 
{
    public:
        static const std::string DELIMITERS;
        Tokenizer(const std::string& str);
        Tokenizer(const std::string& str, const std::string& delimiters);
        bool NextToken();
        bool NextToken(const std::string& delimiters);
        const std::string GetToken() const;
        void Reset();
    protected:
        size_t m_offset;
        const std::string m_string;
        std::string m_token;
        std::string m_delimiters;
};

//CPP file
const std::string Tokenizer::DELIMITERS(" \t\n\r");

Tokenizer::Tokenizer(const std::string& s) :
    m_string(s), 
    m_offset(0), 
    m_delimiters(DELIMITERS) {}

Tokenizer::Tokenizer(const std::string& s, const std::string& delimiters) :
    m_string(s), 
    m_offset(0), 
    m_delimiters(delimiters) {}

bool Tokenizer::NextToken() 
{
    return NextToken(m_delimiters);
}

bool Tokenizer::NextToken(const std::string& delimiters) 
{
    size_t i = m_string.find_first_not_of(delimiters, m_offset);
    if (std::string::npos == i) 
    {
        m_offset = m_string.length();
        return false;
    }

    size_t j = m_string.find_first_of(delimiters, i);
    if (std::string::npos == j) 
    {
        m_token = m_string.substr(i);
        m_offset = m_string.length();
        return true;
    }

    m_token = m_string.substr(i, j - i);
    m_offset = j;
    return true;
}

Example:

std::vector <std::string> v;
Tokenizer s("split this string", " ");
while (s.NextToken())
{
    v.push_back(s.GetToken());
}

Upvotes: 23

Parham
Parham

Reputation: 3472

This is a simple STL-only solution (~5 lines!) using std::find and std::find_first_not_of that handles repetitions of the delimiter (like spaces or periods for instance), as well leading and trailing delimiters:

#include <string>
#include <vector>

void tokenize(std::string str, std::vector<string> &token_v){
    size_t start = str.find_first_not_of(DELIMITER), end=start;

    while (start != std::string::npos){
        // Find next occurence of delimiter
        end = str.find(DELIMITER, start);
        // Push back the token found into vector
        token_v.push_back(str.substr(start, end-start));
        // Skip all occurences of the delimiter to find new start
        start = str.find_first_not_of(DELIMITER, end);
    }
}

Try it out live!

Upvotes: 29

CATspellsDOG
CATspellsDOG

Reputation: 109

I made a lexer/tokenizer before with the use of only standard libraries. Here's the code:

#include <iostream>
#include <string>
#include <vector>
#include <sstream>

using namespace std;

string seps(string& s) {
    if (!s.size()) return "";
    stringstream ss;
    ss << s[0];
    for (int i = 1; i < s.size(); i++) {
        ss << '|' << s[i];
    }
    return ss.str();
}

void Tokenize(string& str, vector<string>& tokens, const string& delimiters = " ")
{
    seps(str);

    // Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

int main(int argc, char *argv[])
{
    vector<string> t;
    string s = "Tokens for everyone!";

    Tokenize(s, t, "|");

    for (auto c : t)
        cout << c << endl;

    system("pause");

    return 0;
}

Upvotes: 0

w.b
w.b

Reputation: 11228

A solution using regex_token_iterators:

#include <iostream>
#include <regex>
#include <string>

using namespace std;

int main()
{
    string str("The quick brown fox");

    regex reg("\\s+");

    sregex_token_iterator iter(str.begin(), str.end(), reg, -1);
    sregex_token_iterator end;

    vector<string> vec(iter, end);

    for (auto a : vec)
    {
        cout << a << endl;
    }
}

Upvotes: 79

Raz
Raz

Reputation: 1932

Boost has a strong split function: boost::algorithm::split.

Sample program:

#include <vector>
#include <boost/algorithm/string.hpp>

int main() {
    auto s = "a,b, c ,,e,f,";
    std::vector<std::string> fields;
    boost::split(fields, s, boost::is_any_of(","));
    for (const auto& field : fields)
        std::cout << "\"" << field << "\"\n";
    return 0;
}

Output:

"a"
"b"
" c "
""
"e"
"f"
""

Upvotes: 39

odinthenerd
odinthenerd

Reputation: 5552

Seems odd to me that with all us speed conscious nerds here on SO no one has presented a version that uses a compile time generated look up table for the delimiter (example implementation further down). Using a look up table and iterators should beat std::regex in efficiency, if you don't need to beat regex, just use it, its standard as of C++11 and super flexible.

Some have suggested regex already but for the noobs here is a packaged example that should do exactly what the OP expects:

std::vector<std::string> split(std::string::const_iterator it, std::string::const_iterator end, std::regex e = std::regex{"\\w+"}){
    std::smatch m{};
    std::vector<std::string> ret{};
    while (std::regex_search (it,end,m,e)) {
        ret.emplace_back(m.str());              
        std::advance(it, m.position() + m.length()); //next start position = match position + match length
    }
    return ret;
}
std::vector<std::string> split(const std::string &s, std::regex e = std::regex{"\\w+"}){  //comfort version calls flexible version
    return split(s.cbegin(), s.cend(), std::move(e));
}
int main ()
{
    std::string str {"Some people, excluding those present, have been compile time constants - since puberty."};
    auto v = split(str);
    for(const auto&s:v){
        std::cout << s << std::endl;
    }
    std::cout << "crazy version:" << std::endl;
    v = split(str, std::regex{"[^e]+"});  //using e as delim shows flexibility
    for(const auto&s:v){
        std::cout << s << std::endl;
    }
    return 0;
}

If we need to be faster and accept the constraint that all chars must be 8 bits we can make a look up table at compile time using metaprogramming:

template<bool...> struct BoolSequence{};        //just here to hold bools
template<char...> struct CharSequence{};        //just here to hold chars
template<typename T, char C> struct Contains;   //generic
template<char First, char... Cs, char Match>    //not first specialization
struct Contains<CharSequence<First, Cs...>,Match> :
    Contains<CharSequence<Cs...>, Match>{};     //strip first and increase index
template<char First, char... Cs>                //is first specialization
struct Contains<CharSequence<First, Cs...>,First>: std::true_type {}; 
template<char Match>                            //not found specialization
struct Contains<CharSequence<>,Match>: std::false_type{};

template<int I, typename T, typename U> 
struct MakeSequence;                            //generic
template<int I, bool... Bs, typename U> 
struct MakeSequence<I,BoolSequence<Bs...>, U>:  //not last
    MakeSequence<I-1, BoolSequence<Contains<U,I-1>::value,Bs...>, U>{};
template<bool... Bs, typename U> 
struct MakeSequence<0,BoolSequence<Bs...>,U>{   //last  
    using Type = BoolSequence<Bs...>;
};
template<typename T> struct BoolASCIITable;
template<bool... Bs> struct BoolASCIITable<BoolSequence<Bs...>>{
    /* could be made constexpr but not yet supported by MSVC */
    static bool isDelim(const char c){
        static const bool table[256] = {Bs...};
        return table[static_cast<int>(c)];
    }   
};
using Delims = CharSequence<'.',',',' ',':','\n'>;  //list your custom delimiters here
using Table = BoolASCIITable<typename MakeSequence<256,BoolSequence<>,Delims>::Type>;

With that in place making a getNextToken function is easy:

template<typename T_It>
std::pair<T_It,T_It> getNextToken(T_It begin,T_It end){
    begin = std::find_if(begin,end,std::not1(Table{})); //find first non delim or end
    auto second = std::find_if(begin,end,Table{});      //find first delim or end
    return std::make_pair(begin,second);
}

Using it is also easy:

int main() {
    std::string s{"Some people, excluding those present, have been compile time constants - since puberty."};
    auto it = std::begin(s);
    auto end = std::end(s);
    while(it != std::end(s)){
        auto token = getNextToken(it,end);
        std::cout << std::string(token.first,token.second) << std::endl;
        it = token.second;
    }
    return 0;
}

Here is a live example: http://ideone.com/GKtkLQ

Upvotes: 2

robcsi
robcsi

Reputation: 264

I've been searching for a way to split a string by a separator of any length, so I started writing it from scratch, as existing solutions didn't suit me.

Here is my little algorithm, using only STL:

//use like this
//std::vector<std::wstring> vec = Split<std::wstring> (L"Hello##world##!", L"##");

template <typename valueType>
static std::vector <valueType> Split (valueType text, const valueType& delimiter)
{
    std::vector <valueType> tokens;
    size_t pos = 0;
    valueType token;

    while ((pos = text.find(delimiter)) != valueType::npos) 
    {
        token = text.substr(0, pos);
        tokens.push_back (token);
        text.erase(0, pos + delimiter.length());
    }
    tokens.push_back (text);

    return tokens;
}

It can be used with separator of any length and form, as far as I've tested. Instantiate with either string or wstring type.

All the algorithm does is it searches for the delimiter, gets the part of the string that is up to the delimiter, deletes the delimiter and searches again until it finds it no more.

Hope it helps.

Upvotes: 0

Murphy78
Murphy78

Reputation: 49

/// split a string into multiple sub strings, based on a separator string
/// for example, if separator="::",
///
/// s = "abc" -> "abc"
///
/// s = "abc::def xy::st:" -> "abc", "def xy" and "st:",
///
/// s = "::abc::" -> "abc"
///
/// s = "::" -> NO sub strings found
///
/// s = "" -> NO sub strings found
///
/// then append the sub-strings to the end of the vector v.
/// 
/// the idea comes from the findUrls() function of "Accelerated C++", chapt7,
/// findurls.cpp
///
void split(const string& s, const string& sep, vector<string>& v)
{
    typedef string::const_iterator iter;
    iter b = s.begin(), e = s.end(), i;
    iter sep_b = sep.begin(), sep_e = sep.end();

    // search through s
    while (b != e){
        i = search(b, e, sep_b, sep_e);

        // no more separator found
        if (i == e){
            // it's not an empty string
            if (b != e)
                v.push_back(string(b, e));
            break;
        }
        else if (i == b){
            // the separator is found and right at the beginning
            // in this case, we need to move on and search for the
            // next separator
            b = i + sep.length();
        }
        else{
            // found the separator
            v.push_back(string(b, i));
            b = i;
        }
    }
}

The boost library is good, but they are not always available. Doing this sort of things by hand is also a good brain exercise. Here we just use the std::search() algorithm from the STL, see the above code.

Upvotes: 0

vsoftco
vsoftco

Reputation: 56547

Simple C++ code (standard C++98), accepts multiple delimiters (specified in a std::string), uses only vectors, strings and iterators.

#include <iostream>
#include <vector>
#include <string>
#include <stdexcept> 

std::vector<std::string> 
split(const std::string& str, const std::string& delim){
    std::vector<std::string> result;
    if (str.empty())
        throw std::runtime_error("Can not tokenize an empty string!");
    std::string::const_iterator begin, str_it;
    begin = str_it = str.begin(); 
    do {
        while (delim.find(*str_it) == std::string::npos && str_it != str.end())
            str_it++; // find the position of the first delimiter in str
        std::string token = std::string(begin, str_it); // grab the token
        if (!token.empty()) // empty token only when str starts with a delimiter
            result.push_back(token); // push the token into a vector<string>
        while (delim.find(*str_it) != std::string::npos && str_it != str.end())
            str_it++; // ignore the additional consecutive delimiters
        begin = str_it; // process the remaining tokens
        } while (str_it != str.end());
    return result;
}

int main() {
    std::string test_string = ".this is.a.../.simple;;test;;;END";
    std::string delim = "; ./"; // string containing the delimiters
    std::vector<std::string> tokens = split(test_string, delim);           
    for (std::vector<std::string>::const_iterator it = tokens.begin(); 
        it != tokens.end(); it++)
            std::cout << *it << std::endl;
}

Upvotes: 0

Daren Thomas
Daren Thomas

Reputation: 70314

I thought that was what the >> operator on string streams was for:

string word; sin >> word;

Upvotes: 3

Mr.Ree
Mr.Ree

Reputation: 8418

No offense folks, but for such a simple problem, you are making things way too complicated. There are a lot of reasons to use Boost. But for something this simple, it's like hitting a fly with a 20# sledge.

void
split( vector<string> & theStringVector,  /* Altered/returned value */
       const  string  & theString,
       const  string  & theDelimiter)
{
    UASSERT( theDelimiter.size(), >, 0); // My own ASSERT macro.

    size_t  start = 0, end = 0;

    while ( end != string::npos)
    {
        end = theString.find( theDelimiter, start);

        // If at end, use length=maxLength.  Else use length=end-start.
        theStringVector.push_back( theString.substr( start,
                       (end == string::npos) ? string::npos : end - start));

        // If at end, use start=maxSize.  Else use start=end+delimiter.
        start = (   ( end > (string::npos - theDelimiter.size()) )
                  ?  string::npos  :  end + theDelimiter.size());
    }
}

For example (for Doug's case),

#define SHOW(I,X)   cout << "[" << (I) << "]\t " # X " = \"" << (X) << "\"" << endl

int
main()
{
    vector<string> v;

    split( v, "A:PEP:909:Inventory Item", ":" );

    for (unsigned int i = 0;  i < v.size();   i++)
        SHOW( i, v[i] );
}

And yes, we could have split() return a new vector rather than passing one in. It's trivial to wrap and overload. But depending on what I'm doing, I often find it better to re-use pre-existing objects rather than always creating new ones. (Just as long as I don't forget to empty the vector in between!)

Reference: http://www.cplusplus.com/reference/string/string/.

(I was originally writing a response to Doug's question: C++ Strings Modifying and Extracting based on Separators (closed). But since Martin York closed that question with a pointer over here... I'll just generalize my code.)

Upvotes: 49

Karthik
Karthik

Reputation: 1

This a simple loop to tokenise with only standard library files

#include <iostream.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <conio.h>
class word
    {
     public:
     char w[20];
     word()
      {
        for(int j=0;j<=20;j++)
        {w[j]='\0';
      }
   }



};

void main()
  {
    int i=1,n=0,j=0,k=0,m=1;
    char input[100];
    word ww[100];
    gets(input);

    n=strlen(input);


    for(i=0;i<=m;i++)
      {
        if(context[i]!=' ')
         {
            ww[k].w[j]=context[i];
            j++;

         }
         else
        {
         k++;
         j=0;
         m++;
        }

   }
 }

Upvotes: -4

Darren Smith
Darren Smith

Reputation: 2488

Here's an approach that allows you control over whether empty tokens are included (like strsep) or excluded (like strtok).

#include <string.h> // for strchr and strlen

/*
 * want_empty_tokens==true  : include empty tokens, like strsep()
 * want_empty_tokens==false : exclude empty tokens, like strtok()
 */
std::vector<std::string> tokenize(const char* src,
                                  char delim,
                                  bool want_empty_tokens)
{
  std::vector<std::string> tokens;

  if (src and *src != '\0') // defensive
    while( true )  {
      const char* d = strchr(src, delim);
      size_t len = (d)? d-src : strlen(src);

      if (len or want_empty_tokens)
        tokens.push_back( std::string(src, len) ); // capture token

      if (d) src += len+1; else break;
    }

  return tokens;
}

Upvotes: 2

David919
David919

Reputation: 41

Many overly complicated suggestions here. Try this simple std::string solution:

using namespace std;

string someText = ...

string::size_type tokenOff = 0, sepOff = tokenOff;
while (sepOff != string::npos)
{
    sepOff = someText.find(' ', sepOff);
    string::size_type tokenLen = (sepOff == string::npos) ? sepOff : sepOff++ - tokenOff;
    string token = someText.substr(tokenOff, tokenLen);
    if (!token.empty())
        /* do something with token */;
    tokenOff = sepOff;
}

Upvotes: 4

jochenleidner
jochenleidner

Reputation: 83

boost::tokenizer is your friend, but consider making your code portable with reference to internationalization (i18n) issues by using wstring/wchar_t instead of the legacy string/char types.

#include <iostream>
#include <boost/tokenizer.hpp>
#include <string>

using namespace std;
using namespace boost;

typedef tokenizer<char_separator<wchar_t>,
                  wstring::const_iterator, wstring> Tok;

int main()
{
  wstring s;
  while (getline(wcin, s)) {
    char_separator<wchar_t> sep(L" "); // list of separator characters
    Tok tok(s, sep);
    for (Tok::iterator beg = tok.begin(); beg != tok.end(); ++beg) {
      wcout << *beg << L"\t"; // output (or store in vector)
    }
    wcout << L"\n";
  }
  return 0;
}

Upvotes: 0

Ferruccio
Ferruccio

Reputation: 100638

The Boost tokenizer class can make this sort of thing quite simple:

#include <iostream>
#include <string>
#include <boost/foreach.hpp>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int, char**)
{
    string text = "token, test   string";

    char_separator<char> sep(", ");
    tokenizer< char_separator<char> > tokens(text, sep);
    BOOST_FOREACH (const string& t, tokens) {
        cout << t << "." << endl;
    }
}

Updated for C++11:

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int, char**)
{
    string text = "token, test   string";

    char_separator<char> sep(", ");
    tokenizer<char_separator<char>> tokens(text, sep);
    for (const auto& t : tokens) {
        cout << t << "." << endl;
    }
}

Upvotes: 195

Fawix
Fawix

Reputation: 469

You can simply use a regular expression library and solve that using regular expressions.

Use expression (\w+) and the variable in \1 (or $1 depending on the library implementation of regular expressions).

Upvotes: 4

dbr
dbr

Reputation: 169535

pystring is a small library which implements a bunch of Python's string functions, including the split method:

#include <string>
#include <vector>
#include "pystring.h"

std::vector<std::string> chunks;
pystring::split("this string", chunks);

// also can specify a separator
pystring::split("this-string", chunks, "-");

Upvotes: 15

Arash
Arash

Reputation: 39

you can take advantage of boost::make_find_iterator. Something similar to this:

template<typename CH>
inline vector< basic_string<CH> > tokenize(
    const basic_string<CH> &Input,
    const basic_string<CH> &Delimiter,
    bool remove_empty_token
    ) {

    typedef typename basic_string<CH>::const_iterator string_iterator_t;
    typedef boost::find_iterator< string_iterator_t > string_find_iterator_t;

    vector< basic_string<CH> > Result;
    string_iterator_t it = Input.begin();
    string_iterator_t it_end = Input.end();
    for(string_find_iterator_t i = boost::make_find_iterator(Input, boost::first_finder(Delimiter, boost::is_equal()));
        i != string_find_iterator_t();
        ++i) {
        if(remove_empty_token){
            if(it != i->begin())
                Result.push_back(basic_string<CH>(it,i->begin()));
        }
        else
            Result.push_back(basic_string<CH>(it,i->begin()));
        it = i->end();
    }
    if(it != it_end)
        Result.push_back(basic_string<CH>(it,it_end));

    return Result;
}

Upvotes: 1

Related Questions