Reputation: 78538

How do I iterate over the words of a string?

How do I iterate over the words of a string composed of words separated by whitespace?

Note that I'm not interested in C string functions or that kind of character manipulation/access. I prefer elegance over efficiency. My current solution:

#include <iostream>
#include <sstream>
#include <string>

using namespace std;

int main() {
    string s = "Somewhere down the road";
    istringstream iss(s);

    do {
        string subs;
        iss >> subs;
        cout << "Substring: " << subs << endl;
    } while (iss);
}

Upvotes: 3377

Answers (30)

ashvardanian

Reputation: 494

First, there are many whitespace characters. Within ASCII those are: Space (0x20), Tab (0x09), Carriage return (0x0D), Line feed (0x0A), Vertical tab (0x0B), Form feed (0x0C). So if you are tokenizing text, most of the presented solutions won't work. Even if you care only about 0x20, they will be notoriously slow, generally around 500 MB/s on most hardware, compared to 10 GB/s typical single-core memory throughput.

I have 3 alternatives, depending on your taste:

A manual implementation using std::string_view::find_first_of.
Using Eric Niebler's ranges-v3 because std::ranges won't generalize.
Using Ash Vardanian's (my) stringzilla SIMD-accelerated library.

Manual Implementation

#include <string_view>

template <typename Callback>
void split(std::string_view text, std::string_view delimiters, Callback callback) noexcept {
    std::size_t start = 0, end;
    while ((end = text.find_first_of(delimiters, start)) != std::string_view::npos) {
        if (start != end) callback(text.substr(start, end - start));
        start = end + 1;
    }
    if (start < text.size()) callback(text.substr(start));
}

Assuming .find_first_of has a nested loop, you may prefer to replace it with a loop-unrolled variant for performance. The same applies to the next solution.

Ranges V3

The C++20 std::ranges::split can't handle custom predicates. Instead, we can use Eric's original ranges::view::split_when with a lambda, merging consecutive slices similar to other answers.

#include <range/v3/view/split.hpp>

template <typename Callback>
void split(std::string_view text, std::string_view delimiters, Callback callback) noexcept {
    auto is_delimiter = [&](char c) { return delimiters.find(c) != std::string_view::npos; };
    for (auto range : text | ranges::views::split(is_delimiter)) {
        if (!range.empty()) {
            auto token = std::string_view(&*range.begin(), ranges::distance(range));
            callback(token);
        }
    }
}

StringZilla

If you often deal with strings, need higher compatibility than STL, and prefer your code to be fast, here is my alternative:

#include <stringzilla/stringzilla.hpp>

template <typename Callback>
void split(std::string_view text, std::string_view delimiters, Callback callback) noexcept {
    namespace sz = ashvardanian::stringzilla;
    for (sz::string_view token : sz::string_view(text).split(sz::char_set(delimiters))) {
        if (!token.empty()) callback(token);
    }
}

It should be several times faster at runtime than any other C/C++ library that does not use AVX-512 on x86 or Neon on Arm.

Upvotes: 2

Porsche9II

Reputation: 661

Using std::string_view and Eric Niebler's range-v3 library:

https://wandbox.org/permlink/kW5lwRCL1pxjp2pW

#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"
#include "range/v3/algorithm.hpp"

int main() {
    std::string s = "Somewhere down the range v3 library";
    ranges::for_each(s  
        |   ranges::view::split(' ')
        |   ranges::view::transform([](auto &&sub) {
                return std::string_view(&*sub.begin(), ranges::distance(sub));
            }),
        [](auto s) {std::cout << "Substring: " << s << "\n";}
    );
}

By using a range for loop instead of ranges::for_each algorithm:

#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"

int main()
{
    std::string str = "Somewhere down the range v3 library";
    for (auto s : str | ranges::view::split(' ')
                      | ranges::view::transform([](auto&& sub) { return std::string_view(&*sub.begin(), ranges::distance(sub)); }
                      ))
    {
        std::cout << "Substring: " << s << "\n";
    }
}

As sehe pointed out in the comments, the preferred std::ranges solution should be:

#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
using namespace std::literals;

int main() {
    static constexpr auto str = "Somewhere down the c++20 standard library"sv;
    for (auto [b, e] : str | std::ranges::views::split(' '))
        std::cout << "Substring: " << quoted(std::string_view(b, e)) << "\n";
}

Even compiling with -std=c++20 should do the trick.

Upvotes: 23

dk123

Reputation: 19700

Here's a simple solution that uses only the standard regex library

#include <regex>
#include <string>
#include <vector>

std::vector<std::string> Tokenize( const string str, const std::regex regex )
{
    using namespace std;

    std::vector<string> result;

    sregex_token_iterator it( str.begin(), str.end(), regex, -1 );
    sregex_token_iterator reg_end;

    for ( ; it != reg_end; ++it ) {
        if ( !it->str().empty() ) //token could be empty:check
            result.emplace_back( it->str() );
    }

    return result;
}

The regex argument allows checking for multiple arguments (spaces, commas, etc.)

I usually only check to split on spaces and commas, so I also have this default function:

std::vector<std::string> TokenizeDefault( const string str )
{
    using namespace std;

    regex re( "[\\s,]+" );

    return Tokenize( str, re );
}

The "[\\s,]+" checks for spaces (\\s) and commas (,).

Note, if you want to split wstring instead of string,

change all std::regex to std::wregex
change all sregex_token_iterator to wsregex_token_iterator

Note, you might also want to take the string argument by reference, depending on your compiler.

Upvotes: 37

ahcox

Reputation: 9970

Use a Spliterator (Split Iterator)

It is possible to iterate over the words of the input string without doing any heap allocations building intermediate data structures like a vector of substrings. This Spliterator class returns a string_view for each substring, avoiding allocations.

struct Spliterator
{
    Spliterator(string_view sentence) : sentence_(sentence), word_end_(sentence.begin())
    {
        next();
    }
    operator string_view() const { return {word_begin_, word_end_};}
    void next()
    {
        word_begin_ = word_end_;
        while(word_begin_ != sentence_.end() && std::isspace(*word_begin_)) ++word_begin_;
        word_end_ = word_begin_;
        while(word_end_ != sentence_.end() && (!std::isspace(*word_end_))) ++word_end_;
    }
    string_view sentence_;
    string_view::iterator word_begin_;
    string_view::iterator word_end_;
};

Usage:

void process_word(string_view word); // Do whatever you want with the words.

void process_words(string_view sentence)
{
    Spliterator spliterator {sentence};
    string_view word;
    while((word = spliterator).length() > 0)
    {
        process_word(word);
        spliterator.next();
    }
}

This idea can be generalised to take a user-specified set of splitting characters.

Upvotes: 0

Marius

Reputation: 3511

An efficient, small, and elegant solution using a template function:

template <class ContainerT>
void split(const std::string& str, ContainerT& tokens,
           const std::string& delimiters = " ", bool trimEmpty = false)
{
   std::string::size_type pos, lastPos = 0, length = str.length();
   
   using value_type = typename ContainerT::value_type;
   using size_type = typename ContainerT::size_type;
   
   while (lastPos < length + 1)
   {
      pos = str.find_first_of(delimiters, lastPos);
      if (pos == std::string::npos)
         pos = length;

      if (pos != lastPos || !trimEmpty)
         tokens.emplace_back(value_type(str.data() + lastPos,
               (size_type)pos - lastPos));

      lastPos = pos + 1;
   }
}

I usually choose to use std::vector<std::string> types as my second parameter (ContainerT)... but list<...> may sometimes be preferred over vector<...>.

It also allows you to specify whether to trim empty tokens from the results via a last optional parameter.

All it requires is std::string included via <string>. It does not use streams or the boost library explicitly but will be able to accept some of these types.

Also since C++-17 you can use std::vector<std::string_view> which is much faster and more memory-efficient than using std::string. Here is a revised version which also supports the container as a return type:

#include <vector>
#include <string_view>
#include <utility>
    
template < typename StringT,
           typename DelimiterT = char,
           typename ContainerT = std::vector<std::string_view> >
ContainerT split(StringT const& str, DelimiterT const& delimiters = ' ', bool trimEmpty = true, ContainerT&& tokens = {})
{
    typename StringT::size_type pos, lastPos = 0, length = str.length();

    while (lastPos < length + 1)
    {
        pos = str.find_first_of(delimiters, lastPos);
        if (pos == StringT::npos)
            pos = length;

      if (pos != lastPos || !trimEmpty)
            tokens.emplace_back(str.data() + lastPos, pos - lastPos);

        lastPos = pos + 1;
    }

    return std::forward<ContainerT>(tokens);
}

Care has been taken not to make any unneeded copies.

This will allow for either:

for (auto const& line : split(str, '\n'))

Or:

auto& lines = split(str, '\n');

Both returning the default template container type of std::vector<std::string_view>.

To get a specific container type back, or to pass an existing container, use the tokens input parameter with either a typed initial container or an existing container variable:

auto& lines = split(str, '\n', false, std::vector<std::string>());

Or:

std::vector<std::string> lines;
split(str, '\n', false, lines);

Upvotes: 202

J. Willus

Reputation: 577

C++20 finally blesses us with a split function. Or rather, a range adapter. Godbolt link.

#include <iostream>
#include <ranges>
#include <string_view>

namespace ranges = std::ranges;
namespace views = std::views;

using str = std::string_view;

auto view =
    "Multiple words"
    | views::split(' ')
    | views::transform([](auto &&r) -> str {
        return str(r.begin(), r.end());
    });

auto main() -> int {
    for (str &&sv : view) {
        std::cout << sv << '\n';
    }
}

Upvotes: 29

Kelly Elton

Reputation: 4427

The following is a much better way to do this. It can take any character and doesn't split lines unless you want. No special libraries are needed (well, besides std, but who really considers that an extra library) No pointers or references are needed, and it's static. Just simple plain C++.

#pragma once
#include <vector>
#include <sstream>
using namespace std;
class Helpers
{
    public:
        static vector<string> split(string s, char delim)
        {
            stringstream temp (stringstream::in | stringstream::out);
            vector<string> elems(0);
            if (s.size() == 0 || delim == 0)
                return elems;
            for(char c : s)
            {
                if(c == delim)
                {
                    elems.push_back(temp.str());
                    temp = stringstream(stringstream::in | stringstream::out);
                }
                else
                    temp << c;
            }
            if (temp.str().size() > 0)
                elems.push_back(temp.str());
                return elems;
            }

        //Splits string s with a list of delimiters in delims (it's just a list, like if we wanted to
        //split at the following letters, a, b, c we would make delims="abc".
        static vector<string> split(string s, string delims)
        {
            stringstream temp (stringstream::in | stringstream::out);
            vector<string> elems(0);
            bool found;
            if(s.size() == 0 || delims.size() == 0)
                return elems;
            for(char c : s)
            {
                found = false;
                for(char d : delims)
                {
                    if (c == d)
                    {
                        elems.push_back(temp.str());
                        temp = stringstream(stringstream::in | stringstream::out);
                        found = true;
                        break;
                    }
                }
                if(!found)
                    temp << c;
            }
            if(temp.str().size() > 0)
                elems.push_back(temp.str());
            return elems;
        }
};

Upvotes: 6

Steve Dell

Reputation: 605

I made this because I needed an easy way to split strings and C-based strings. Hopefully someone else can find it useful as well. Also, it doesn't rely on tokens, and you can use fields as delimiters, which is another key I needed.

I'm sure there are improvements that can be made to even further improve its elegance, and please do by all means.

StringSplitter.hpp:

#include <vector>
#include <iostream>
#include <string.h>

using namespace std;

class StringSplit
{
private:
    void copy_fragment(char*, char*, char*);
    void copy_fragment(char*, char*, char);
    bool match_fragment(char*, char*, int);
    int untilnextdelim(char*, char);
    int untilnextdelim(char*, char*);
    void assimilate(char*, char);
    void assimilate(char*, char*);
    bool string_contains(char*, char*);
    long calc_string_size(char*);
    void copy_string(char*, char*);

public:
    vector<char*> split_cstr(char);
    vector<char*> split_cstr(char*);
    vector<string> split_string(char);
    vector<string> split_string(char*);
    char* String;
    bool do_string;
    bool keep_empty;
    vector<char*> Container;
    vector<string> ContainerS;

    StringSplit(char * in)
    {
        String = in;
    }

    StringSplit(string in)
    {
        size_t len = calc_string_size((char*)in.c_str());
        String = new char[len + 1];
        memset(String, 0, len + 1);
        copy_string(String, (char*)in.c_str());
        do_string = true;
    }

    ~StringSplit()
    {
        for (int i = 0; i < Container.size(); i++)
        {
            if (Container[i] != NULL)
            {
                delete[] Container[i];
            }
        }
        if (do_string)
        {
            delete[] String;
        }
    }
};

StringSplitter.cpp:

#include <string.h>
#include <iostream>
#include <vector>
#include "StringSplit.hpp"

using namespace std;

void StringSplit::assimilate(char*src, char delim)
{
    int until = untilnextdelim(src, delim);
    if (until > 0)
    {
        char * temp = new char[until + 1];
        memset(temp, 0, until + 1);
        copy_fragment(temp, src, delim);
        if (keep_empty || *temp != 0)
        {
            if (!do_string)
            {
                Container.push_back(temp);
            }
            else
            {
                string x = temp;
                ContainerS.push_back(x);
            }

        }
        else
        {
            delete[] temp;
        }
    }
}

void StringSplit::assimilate(char*src, char* delim)
{
    int until = untilnextdelim(src, delim);
    if (until > 0)
    {
        char * temp = new char[until + 1];
        memset(temp, 0, until + 1);
        copy_fragment(temp, src, delim);
        if (keep_empty || *temp != 0)
        {
            if (!do_string)
            {
                Container.push_back(temp);
            }
            else
            {
                string x = temp;
                ContainerS.push_back(x);
            }
        }
        else
        {
            delete[] temp;
        }
    }
}

long StringSplit::calc_string_size(char* _in)
{
    long i = 0;
    while (*_in++)
    {
        i++;
    }
    return i;
}

bool StringSplit::string_contains(char* haystack, char* needle)
{
    size_t len = calc_string_size(needle);
    size_t lenh = calc_string_size(haystack);
    while (lenh--)
    {
        if (match_fragment(haystack + lenh, needle, len))
        {
            return true;
        }
    }
    return false;
}

bool StringSplit::match_fragment(char* _src, char* cmp, int len)
{
    while (len--)
    {
        if (*(_src + len) != *(cmp + len))
        {
            return false;
        }
    }
    return true;
}

int StringSplit::untilnextdelim(char* _in, char delim)
{
    size_t len = calc_string_size(_in);
    if (*_in == delim)
    {
        _in += 1;
        return len - 1;
    }

    int c = 0;
    while (*(_in + c) != delim && c < len)
    {
        c++;
    }

    return c;
}

int StringSplit::untilnextdelim(char* _in, char* delim)
{
    int s = calc_string_size(delim);
    int c = 1 + s;

    if (!string_contains(_in, delim))
    {
        return calc_string_size(_in);
    }
    else if (match_fragment(_in, delim, s))
    {
        _in += s;
        return calc_string_size(_in);
    }

    while (!match_fragment(_in + c, delim, s))
    {
        c++;
    }

    return c;
}

void StringSplit::copy_fragment(char* dest, char* src, char delim)
{
    if (*src == delim)
    {
        src++;
    }
        
    int c = 0;
    while (*(src + c) != delim && *(src + c))
    {
        *(dest + c) = *(src + c);
        c++;
    }
    *(dest + c) = 0;
}

void StringSplit::copy_string(char* dest, char* src)
{
    int i = 0;
    while (*(src + i))
    {
        *(dest + i) = *(src + i);
        i++;
    }
}

void StringSplit::copy_fragment(char* dest, char* src, char* delim)
{
    size_t len = calc_string_size(delim);
    size_t lens = calc_string_size(src);
    
    if (match_fragment(src, delim, len))
    {
        src += len;
        lens -= len;
    }
    
    int c = 0;
    while (!match_fragment(src + c, delim, len) && (c < lens))
    {
        *(dest + c) = *(src + c);
        c++;
    }
    *(dest + c) = 0;
}

vector<char*> StringSplit::split_cstr(char Delimiter)
{
    int i = 0;
    while (*String)
    {
        if (*String != Delimiter && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (*String == Delimiter)
        {
            assimilate(String, Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return Container;
}

vector<string> StringSplit::split_string(char Delimiter)
{
    do_string = true;
    
    int i = 0;
    while (*String)
    {
        if (*String != Delimiter && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (*String == Delimiter)
        {
            assimilate(String, Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return ContainerS;
}

vector<char*> StringSplit::split_cstr(char* Delimiter)
{
    int i = 0;
    size_t LenDelim = calc_string_size(Delimiter);

    while(*String)
    {
        if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (match_fragment(String, Delimiter, LenDelim))
        {
            assimilate(String,Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return Container;
}

vector<string> StringSplit::split_string(char* Delimiter)
{
    do_string = true;
    int i = 0;
    size_t LenDelim = calc_string_size(Delimiter);

    while (*String)
    {
        if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
        {
            assimilate(String, Delimiter);
        }
        if (match_fragment(String, Delimiter, LenDelim))
        {
            assimilate(String, Delimiter);
        }
        i++;
        String++;
    }

    String -= i;
    delete[] String;

    return ContainerS;
}

Examples:

int main(int argc, char*argv[])
{
    StringSplit ss = "This:CUT:is:CUT:an:CUT:example:CUT:cstring";
    vector<char*> Split = ss.split_cstr(":CUT:");

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

Will output:

This
is
an
example
cstring

int main(int argc, char*argv[])
{
    StringSplit ss = "This:is:an:example:cstring";
    vector<char*> Split = ss.split_cstr(':');

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

int main(int argc, char*argv[])
{
    string mystring = "This[SPLIT]is[SPLIT]an[SPLIT]example[SPLIT]string";
    StringSplit ss = mystring;
    vector<string> Split = ss.split_string("[SPLIT]");

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

int main(int argc, char*argv[])
{
    string mystring = "This|is|an|example|string";
    StringSplit ss = mystring;
    vector<string> Split = ss.split_string('|');

    for (int i = 0; i < Split.size(); i++)
    {
        cout << Split[i] << endl;
    }

    return 0;
}

To keep empty entries (by default empties will be excluded):

StringSplit ss = mystring;
ss.keep_empty = true;
vector<string> Split = ss.split_string(":DELIM:");

The goal was to make it similar to C#'s Split() method where splitting a string is as easy as:

String[] Split = 
    "Hey:cut:what's:cut:your:cut:name?".Split(new[]{":cut:"}, StringSplitOptions.None);

foreach(String X in Split)
{
    Console.Write(X);
}

I hope someone else can find this as useful as I do.

Upvotes: 13

Pavan Chandaka

Reputation: 12801

Some C++20 compilers and most of the C++23 compilers (ranges and string_view)

for (auto word : std::views::split("Somewhere down the road", ' '))
        std::cout << std::string_view{ word.begin(), word.end() } << std::endl;

Upvotes: 2

Evan Teran

Reputation: 90513

I use this to split string by a delimiter. The first puts the results in a pre-constructed vector, the second returns a new vector.

#include <string>
#include <sstream>
#include <vector>
#include <iterator>

template <typename Out>
void split(const std::string &s, char delim, Out result) {
    std::istringstream iss(s);
    std::string item;
    while (std::getline(iss, item, delim)) {
        *result++ = item;
    }
}

std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, std::back_inserter(elems));
    return elems;
}

Note that this solution does not skip empty tokens, so the following will find 4 items, one of which is empty:

std::vector<std::string> x = split("one:two::three", ':');

Upvotes: 2590

Yakk - Adam Nevraumont

Reputation: 275820

Yet another way -- continuation passing style, zero allocation, function based delimiting.

 void split( auto&& data, auto&& splitter, auto&& operation ) {
   using std::begin; using std::end;
   auto prev = begin(data);
   while (prev != end(data) ) {
     auto&&[prev,next] = splitter( prev, end(data) );
     operation(prev,next);
     prev = next;
   }
 }

Now we can write specific split functions based off this.

 auto anyOfSplitter(auto delimiters) {
   return [delimiters](auto begin, auto end) {
     while( begin != end && 0 == std::string_view(begin, end).find_first_of(delimiters) ) {
       ++begin;
     }
     auto view = std::string_view(begin, end);
     auto next = view.find_first_of(delimiters);
     if (next != view.npos)
       return std::make_pair( begin, begin + next );
     else
       return std::make_pair( begin, end );
   };
 }

we can now produce a traditional std string split like this:

 template<class C>
 auto traditional_any_of_split( std::string_view<C> str, std::string_view<C> delim ) {
   std::vector<std::basic_string<C>> retval;
   split( str, anyOfSplitter(delim), [&](auto s, auto f) {
     retval.emplace_back(s,f);
   });
   return retval;
 }

or we can use find instead

 auto findSplitter(auto delimiter) {
   return [delimiter](auto begin, auto end) {
     while( begin != end && 0 == std::string_view(begin, end).find(delimiter) ) {
       begin += delimiter.size();
     }
     auto view = std::string_view(begin, end);
     auto next = view.find(delimiter);
     if (next != view.npos)
       return std::make_pair( begin, begin + next );
     else
       return std::make_pair( begin, end );
   };
 }

 template<class C>
 auto traditional_find_split( std::string_view<C> str, std::string_view<C> delim ) {
   std::vector<std::basic_string<C>> retval;
   split( str, findSplitter(delim), [&](auto s, auto f) {
     retval.emplace_back(s,f);
   });
   return retval;
 }

by replacing the splitter portion.

Both of these allocate a buffer of return values. We can swap the return values to string views at the cost of manually managing lifetime.

We can also take a continuation that will get passed the string views one at a time, avoiding even allocating the vector of views.

This can be extended with an abort option, so that we can abort after reading a few prefix strings.

Upvotes: 1

Sam B

Reputation: 27618

I cannot believe how overly complicated most of these answers were. Why didnt someone suggest something as simple as this?

#include <iostream>
#include <sstream>

std::string input = "This is a sentence to read";
std::istringstream ss(input);
std::string token;

while(std::getline(ss, token, ' ')) {
    std::cout << token << endl;
}

Upvotes: 11

AlQuemist

Reputation: 1314

A minimal solution is a function which takes as input a std::string and a set of delimiter characters (as a std::string), and returns a std::vector of std::strings.

#include <string>
#include <vector>

std::vector<std::string>
tokenize(const std::string& str, const std::string& delimiters)
{
  using ssize_t = std::string::size_type;
  const ssize_t str_ln = str.length();
  ssize_t last_pos = 0;

  // container for the extracted tokens
  std::vector<std::string> tokens;

  while (last_pos < str_ln) {
      // find the position of the next delimiter
      ssize_t pos = str.find_first_of(delimiters, last_pos);

      // if no delimiters found, set the position to the length of string
      if (pos == std::string::npos)
         pos = str_ln;

      // if the substring is nonempty, store it in the container
      if (pos != last_pos)
         tokens.emplace_back(str.substr(last_pos, pos - last_pos));

      // scan past the previous substring
      last_pos = pos + 1;
  }

  return tokens;
}

A usage example:

#include <iostream>

int main()
{
    std::string input_str = "one + two * (three - four)!!---! ";
    const char* delimiters = "! +- (*)";
    std::vector<std::string> tokens = tokenize(input_str, delimiters);

    std::cout << "input = '" << input_str << "'\n"
              << "delimiters = '" << delimiters << "'\n"
              << "nr of tokens found = " << tokens.size() << std::endl;
    for (const std::string& tk : tokens) {
        std::cout << "token = '" << tk << "'\n";
    }

  return 0;
}

Upvotes: 1

Ferruccio

Reputation: 100718

This is similar to Stack Overflow question How do I tokenize a string in C++?. Requires Boost external library

#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main(int argc, char** argv)
{
    string text = "token  test\tstring";

    char_separator<char> sep(" \t");
    tokenizer<char_separator<char>> tokens(text, sep);
    for (const string& t : tokens)
    {
        cout << t << "." << endl;
    }
}

Upvotes: 89

user10873315

Reputation:

There's a way easier method to do this!!

#include <vector>
#include <string>
std::vector<std::string> splitby(std::string string, char splitter) {
    int splits = 0;
    std::vector<std::string> result = {};
    std::string locresult = "";
    for (unsigned int i = 0; i < string.size(); i++) {
        if ((char)string.at(i) != splitter) {
            locresult += string.at(i);
        }
        else {
            result.push_back(locresult);
            locresult = "";
        }
    }
    if (splits == 0) {
        result.push_back(locresult);
    }
    return result;
}

void printvector(std::vector<std::string> v) {
    std::cout << '{';
    for (unsigned int i = 0; i < v.size(); i++) {
        if (i < v.size() - 1) {
            std::cout << '"' << v.at(i) << "\",";
        }
        else {
            std::cout << '"' << v.at(i) << "\"";
        }
    }
    std::cout << "}\n";
}

Upvotes: 0

Kaznov

Reputation: 1163

Although there was some answer providing C++20 solution, since it was posted there were some changes made and applied to C++20 as Defect Reports. Because of that the solution is a little bit shorter and nicer:

#include <iostream>
#include <ranges>
#include <string_view>

namespace views = std::views;
using str = std::string_view;

constexpr str text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.";

auto splitByWords(str input) {
    return input
    | views::split(' ')
    | views::transform([](auto &&r) -> str {
        return {r.begin(), r.end()};
    });
}

auto main() -> int {
    for (str &&word : splitByWords(text)) {
        std::cout << word << '\n';
    }
}

As of today it is still available only on the trunk branch of GCC (Godbolt link). It is based on two changes: P1391 iterator constructor for std::string_view and P2210 DR fixing std::views::split to preserve range type.

In C++23 there won't be any transform boilerplate needed, since P1989 adds a range constructor to std::string_view:

#include <iostream>
#include <ranges>
#include <string_view>

namespace views = std::views;

constexpr std::string_view text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.";

auto main() -> int {
    for (std::string_view&& word : text | views::split(' ')) {
        std::cout << word << '\n';
    }
}

(Godbolt link)

Upvotes: 9

Nur Bijoy

Reputation: 72

Everyone answered for predefined string input. I think this answer will help someone for scanned input.

I used tokens vector for holding string tokens. It's optional.

#include <bits/stdc++.h>

using namespace std ;
int main()
{
    string str, token ;
    getline(cin, str) ; // get the string as input
    istringstream ss(str); // insert the string into tokenizer

    vector<string> tokens; // vector tokens holds the tokens

    while (ss >> token) tokens.push_back(token); // splits the tokens
    for(auto x : tokens) cout << x << endl ; // prints the tokens

    return 0;
}

sample input:

port city international university

sample output:

port
city
international
university

Note that by default this will work for only space as the delimiter. you can use custom delimiter. For that, you have customized the code. let the delimiter be ','. so use

char delimiter = ',' ;
while(getline(ss, token, delimiter)) tokens.push_back(token) ;

instead of

while (ss >> token) tokens.push_back(token);

Upvotes: 6

balki

Reputation: 27684

C++17 version without any memory allocation (except may be for std::function)

void iter_words(const std::string_view& input, const std::function<void(std::string_view)>& process_word) {

    auto itr = input.begin();

    auto consume_whitespace = [&]() {
        for(; itr != input.end(); ++itr) {
            if(!isspace(*itr))
                return;
        }
    };

    auto consume_letters = [&]() {
        for(; itr != input.end(); ++itr) {
            if(isspace(*itr))
                return;
        }
    };

    while(true) {
        consume_whitespace();
        if(itr == input.end())
            return;
        auto word_start = itr - input.begin();
        consume_letters();
        auto word_end = itr - input.begin();
        process_word(input.substr(word_start, word_end - word_start));
    }
}

int main() {
    iter_words("foo bar", [](std::string_view sv) {
        std::cout << "Got word: " <<  sv << '\n';
    });
    return 0;
}

Upvotes: 2

Zunino

Reputation: 2034

For what it's worth, here's another way to extract tokens from an input string, relying only on standard library facilities. It's an example of the power and elegance behind the design of the STL.

#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>

int main() {
    using namespace std;
    string sentence = "And I feel fine...";
    istringstream iss(sentence);
    copy(istream_iterator<string>(iss),
         istream_iterator<string>(),
         ostream_iterator<string>(cout, "\n"));
}

Instead of copying the extracted tokens to an output stream, one could insert them into a container, using the same generic copy algorithm.

vector<string> tokens;
copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter(tokens));

... or create the vector directly:

vector<string> tokens{istream_iterator<string>{iss},
                      istream_iterator<string>{}};

Upvotes: 1512

okovko

Reputation: 1911

I have a very different approach from the other solutions that offers a lot of value in ways that the other solutions are variously lacking, but of course also has its own down sides. Here is the working implementation, with the example of putting <tag></tag> around words.

For a start, this problem can be solved with one loop, no additional memory, and by considering merely four logical cases. Conceptually, we're interested in boundaries. Our code should reflect that: let's iterate through the string and look at two characters at a time, bearing in mind that we have special cases at the start and end of the string.

The downside is that we have to write the implementation, which is somewhat verbose, but mostly convenient boilerplate.

The upside is that we wrote the implementation, so it is very easy to customize it to specific needs, such as distinguishing left and write word boundaries, using any set of delimiters, or handling other cases such as non-boundary or erroneous positions.

using namespace std;

#include <iostream>
#include <string>

#include <cctype>

typedef enum boundary_type_e {
    E_BOUNDARY_TYPE_ERROR = -1,
    E_BOUNDARY_TYPE_NONE,
    E_BOUNDARY_TYPE_LEFT,
    E_BOUNDARY_TYPE_RIGHT,
} boundary_type_t;

typedef struct boundary_s {
    boundary_type_t type;
    int pos;
} boundary_t;

bool is_delim_char(int c) {
    return isspace(c); // also compare against any other chars you want to use as delimiters
}

bool is_word_char(int c) {
    return ' ' <= c && c <= '~' && !is_delim_char(c);
}

boundary_t maybe_word_boundary(string str, int pos) {
    int len = str.length();
    if (pos < 0 || pos >= len) {
        return (boundary_t){.type = E_BOUNDARY_TYPE_ERROR};
    } else {
        if (pos == 0 && is_word_char(str[pos])) {
            // if the first character is word-y, we have a left boundary at the beginning
            return (boundary_t){.type = E_BOUNDARY_TYPE_LEFT, .pos = pos};
        } else if (pos == len - 1 && is_word_char(str[pos])) {
            // if the last character is word-y, we have a right boundary left of the null terminator
            return (boundary_t){.type = E_BOUNDARY_TYPE_RIGHT, .pos = pos + 1};
        } else if (!is_word_char(str[pos]) && is_word_char(str[pos + 1])) {
            // if we have a delimiter followed by a word char, we have a left boundary left of the word char
            return (boundary_t){.type = E_BOUNDARY_TYPE_LEFT, .pos = pos + 1};
        } else if (is_word_char(str[pos]) && !is_word_char(str[pos + 1])) {
            // if we have a word char followed by a delimiter, we have a right boundary right of the word char
            return (boundary_t){.type = E_BOUNDARY_TYPE_RIGHT, .pos = pos + 1};
        }
        return (boundary_t){.type = E_BOUNDARY_TYPE_NONE};
    }
}

int main() {
    string str;
    getline(cin, str);

    int len = str.length();
    for (int i = 0; i < len; i++) {
        boundary_t boundary = maybe_word_boundary(str, i);
        if (boundary.type == E_BOUNDARY_TYPE_LEFT) {
            // whatever
        } else if (boundary.type == E_BOUNDARY_TYPE_RIGHT) {
            // whatever
        }
    }
}

As you can see, the code is very simple to understand and fine tune, and the actual usage of the code is very short and simple. Using C++ should not stop us from writing the simplest and most readily customized code possible, even if that means not using the STL. I would think this is an instance of what Linus Torvalds might call "taste", since we have eliminated all the logic we don't need while writing in a style that naturally allows more cases to be handled when and if the need to handle them arises.

What could improve this code might be the use of enum class, accepting a function pointer to is_word_char in maybe_word_boundary instead of invoking is_word_char directly, and passing a lambda.

Upvotes: 1

kev

Reputation: 161914

#include <vector>
#include <string>
#include <sstream>

int main()
{
    std::string str("Split me by whitespaces");
    std::string buf;                 // Have a buffer string
    std::stringstream ss(str);       // Insert the string into a stream

    std::vector<std::string> tokens; // Create vector to hold our words

    while (ss >> buf)
        tokens.push_back(buf);

    return 0;
}

Upvotes: 414

KTC

Reputation: 9003

Using std::stringstream as you have works perfectly fine, and do exactly what you wanted. If you're just looking for different way of doing things though, you can use std::find()/std::find_first_of() and std::string::substr().

Here's an example:

#include <iostream>
#include <string>

int main()
{
    std::string s("Somewhere down the road");
    std::string::size_type prev_pos = 0, pos = 0;

    while( (pos = s.find(' ', pos)) != std::string::npos )
    {
        std::string substring( s.substr(prev_pos, pos-prev_pos) );

        std::cout << substring << '\n';

        prev_pos = ++pos;
    }

    std::string substring( s.substr(prev_pos, pos-prev_pos) ); // Last word
    std::cout << substring << '\n';

    return 0;
}

Upvotes: 33

gnomed

Reputation: 5565

This is my favorite way to iterate through a string. You can do whatever you want per word.

string line = "a line of text to iterate through";
string word;

istringstream iss(line, istringstream::in);

while( iss >> word )     
{
    // Do something on `word` here...
}

Upvotes: 142

user19302

Reputation:

The STL does not have such a method available already.

However, you can either use C's strtok() function by using the std::string::c_str() member, or you can write your own. Here is a code sample I found after a quick Google search ("STL string split"):

void Tokenize(const string& str,
              vector<string>& tokens,
              const string& delimiters = " ")
{
    // Skip delimiters at beginning.
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);
    // Find first "non-delimiter".
    string::size_type pos     = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos)
    {
        // Found a token, add it to the vector.
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters.  Note the "not_of"
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next "non-delimiter"
        pos = str.find_first_of(delimiters, lastPos);
    }
}

Taken from: http://oopweb.com/CPP/Documents/CPPHOWTO/Volume/C++Programming-HOWTO-7.html

If you have questions about the code sample, leave a comment and I will explain.

And just because it does not implement a typedef called iterator or overload the << operator does not mean it is bad code. I use C functions quite frequently. For example, printf and scanf both are faster than std::cin and std::cout (significantly), the fopen syntax is a lot more friendly for binary types, and they also tend to produce smaller EXEs.

Don't get sold on this "Elegance over performance" deal.

Upvotes: 60

NL628

Reputation: 438

This answer takes the string and puts it into a vector of strings. It uses the boost library.

#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));

Upvotes: 13

Oleg

Reputation: 506

#include <iostream>
#include <string>
#include <deque>

std::deque<std::string> split(
    const std::string& line, 
    std::string::value_type delimiter,
    bool skipEmpty = false
) {
    std::deque<std::string> parts{};

    if (!skipEmpty && !line.empty() && delimiter == line.at(0)) {
        parts.push_back({});
    }

    for (const std::string::value_type& c : line) {
        if (
            (
                c == delimiter 
                &&
                (skipEmpty ? (!parts.empty() && !parts.back().empty()) : true)
            )
            ||
            (c != delimiter && parts.empty())
        ) {
            parts.push_back({});
        }

        if (c != delimiter) {
            parts.back().push_back(c);
        }
    }

    if (skipEmpty && !parts.empty() && parts.back().empty()) {
        parts.pop_back();
    }

    return parts;
}

void test(const std::string& line) {
    std::cout << line << std::endl;

    std::cout << "skipEmpty=0 |";
    for (const std::string& part : split(line, ':')) {
        std::cout << part << '|';
    }
    std::cout << std::endl;

    std::cout << "skipEmpty=1 |";
    for (const std::string& part : split(line, ':', true)) {
        std::cout << part << '|';
    }
    std::cout << std::endl;

    std::cout << std::endl;
}

int main() {
    test("foo:bar:::baz");
    test("");
    test("foo");
    test(":");
    test("::");
    test(":foo");
    test("::foo");
    test(":foo:");
    test(":foo::");

    return 0;
}

Output:

foo:bar:::baz
skipEmpty=0 |foo|bar|||baz|
skipEmpty=1 |foo|bar|baz|


skipEmpty=0 |
skipEmpty=1 |

foo
skipEmpty=0 |foo|
skipEmpty=1 |foo|

:
skipEmpty=0 |||
skipEmpty=1 |

::
skipEmpty=0 ||||
skipEmpty=1 |

:foo
skipEmpty=0 ||foo|
skipEmpty=1 |foo|

::foo
skipEmpty=0 |||foo|
skipEmpty=1 |foo|

:foo:
skipEmpty=0 ||foo||
skipEmpty=1 |foo|

:foo::
skipEmpty=0 ||foo|||
skipEmpty=1 |foo|

Upvotes: 0

Joakim L. Christiansen

Reputation: 1429

Not that we need more answers, but this is what I came up with after being inspired by Evan Teran.

std::vector <std::string> split(const string &input, auto delimiter, bool skipEmpty=true) {
  /*
  Splits a string at each delimiter and returns these strings as a string vector.
  If the delimiter is not found then nothing is returned.
  If skipEmpty is true then strings between delimiters that are 0 in length will be skipped.
  */
  bool delimiterFound = false;
  int pos=0, pPos=0;
  std::vector <std::string> result;
  while (true) {
    pos = input.find(delimiter,pPos);
    if (pos != std::string::npos) {
      if (skipEmpty==false or pos-pPos > 0) // if empty values are to be kept or not
        result.push_back(input.substr(pPos,pos-pPos));
      delimiterFound = true;
    } else {
      if (pPos < input.length() and delimiterFound) {
        if (skipEmpty==false or input.length()-pPos > 0) // if empty values are to be kept or not
          result.push_back(input.substr(pPos,input.length()-pPos));
      }
      break;
    }
    pPos = pos+1;
  }
  return result;
}

Upvotes: -1

Romário

Reputation: 1966

Yes, I looked through all 30 examples.

I couldn't find a version of split that works for multi-char delimiters, so here's mine:

#include <string>
#include <vector>

using namespace std;

vector<string> split(const string &str, const string &delim)
{   
    const auto delim_pos = str.find(delim);

    if (delim_pos == string::npos)
        return {str};

    vector<string> ret{str.substr(0, delim_pos)};
    auto tail = split(str.substr(delim_pos + delim.size(), string::npos), delim);

    ret.insert(ret.end(), tail.begin(), tail.end());

    return ret;
}

Probably not the most efficient of implementations, but it's a very straightforward recursive solution, using only <string> and <vector>.

Ah, it's written in C++11, but there's nothing special about this code, so you could easily adapt it to C++98.

Upvotes: 2

fishshrimp鱼虾爆栈

Reputation: 141

my general implementation for string and u32string ~, using the boost::algorithm::split signature.

template<typename CharT, typename UnaryPredicate>
void split(std::vector<std::basic_string<CharT>>& split_result,
           const std::basic_string<CharT>& s,
           UnaryPredicate predicate)
{
    using ST = std::basic_string<CharT>;
    using std::swap;
    std::vector<ST> tmp_result;
    auto iter = s.cbegin(),
         end_iter = s.cend();
    while (true)
    {
        /**
         * edge case: empty str -> push an empty str and exit.
         */
        auto find_iter = find_if(iter, end_iter, predicate);
        tmp_result.emplace_back(iter, find_iter);
        if (find_iter == end_iter) { break; }
        iter = ++find_iter; 
    }
    swap(tmp_result, split_result);
}


template<typename CharT>
void split(std::vector<std::basic_string<CharT>>& split_result,
           const std::basic_string<CharT>& s,
           const std::basic_string<CharT>& char_candidate)
{
    std::unordered_set<CharT> candidate_set(char_candidate.cbegin(),
                                            char_candidate.cend());
    auto predicate = [&candidate_set](const CharT& c) {
        return candidate_set.count(c) > 0U;
    };
    return split(split_result, s, predicate);
}

template<typename CharT>
void split(std::vector<std::basic_string<CharT>>& split_result,
           const std::basic_string<CharT>& s,
           const CharT* literals)
{
    return split(split_result, s, std::basic_string<CharT>(literals));
}

Upvotes: 0

Marco M.

Reputation: 3116

Here is a split function that:

is generic
uses standard C++ (no boost)
accepts multiple delimiters

ignores empty tokens (can easily be changed)

template<typename T>
vector<T> 
split(const T & str, const T & delimiters) {
    vector<T> v;
    typename T::size_type start = 0;
    auto pos = str.find_first_of(delimiters, start);
    while(pos != T::npos) {
        if(pos != start) // ignore empty tokens
            v.emplace_back(str, start, pos - start);
        start = pos + 1;
        pos = str.find_first_of(delimiters, start);
    }
    if(start < str.length()) // ignore trailing delimiter
        v.emplace_back(str, start, str.length() - start); // add what's left of the string
    return v;
}

Example usage:

    vector<string> v = split<string>("Hello, there; World", ";,");
    vector<wstring> v = split<wstring>(L"Hello, there; World", L";,");

Upvotes: 45

How do I iterate over the words of a string?

Answers (30)

Manual Implementation

Ranges V3

StringZilla

Use a Spliterator (Split Iterator)

Related Questions