Reputation: 78538
How do I iterate over the words of a string composed of words separated by whitespace?
Note that I'm not interested in C string functions or that kind of character manipulation/access. I prefer elegance over efficiency. My current solution:
#include <iostream>
#include <sstream>
#include <string>
using namespace std;
int main() {
string s = "Somewhere down the road";
istringstream iss(s);
do {
string subs;
iss >> subs;
cout << "Substring: " << subs << endl;
} while (iss);
}
Upvotes: 3377
Views: 2411571
Reputation: 494
First, there are many whitespace characters. Within ASCII those are: Space (0x20
), Tab (0x09
), Carriage return (0x0D
), Line feed (0x0A
), Vertical tab (0x0B
), Form feed (0x0C
). So if you are tokenizing text, most of the presented solutions won't work. Even if you care only about 0x20
, they will be notoriously slow, generally around 500 MB/s on most hardware, compared to 10 GB/s typical single-core memory throughput.
I have 3 alternatives, depending on your taste:
std::string_view::find_first_of
.ranges-v3
because std::ranges
won't generalize.stringzilla
SIMD-accelerated library.#include <string_view>
template <typename Callback>
void split(std::string_view text, std::string_view delimiters, Callback callback) noexcept {
std::size_t start = 0, end;
while ((end = text.find_first_of(delimiters, start)) != std::string_view::npos) {
if (start != end) callback(text.substr(start, end - start));
start = end + 1;
}
if (start < text.size()) callback(text.substr(start));
}
Assuming .find_first_of
has a nested loop, you may prefer to replace it with a loop-unrolled variant for performance. The same applies to the next solution.
The C++20 std::ranges::split
can't handle custom predicates. Instead, we can use Eric's original ranges::view::split_when
with a lambda, merging consecutive slices similar to other answers.
#include <range/v3/view/split.hpp>
template <typename Callback>
void split(std::string_view text, std::string_view delimiters, Callback callback) noexcept {
auto is_delimiter = [&](char c) { return delimiters.find(c) != std::string_view::npos; };
for (auto range : text | ranges::views::split(is_delimiter)) {
if (!range.empty()) {
auto token = std::string_view(&*range.begin(), ranges::distance(range));
callback(token);
}
}
}
If you often deal with strings, need higher compatibility than STL, and prefer your code to be fast, here is my alternative:
#include <stringzilla/stringzilla.hpp>
template <typename Callback>
void split(std::string_view text, std::string_view delimiters, Callback callback) noexcept {
namespace sz = ashvardanian::stringzilla;
for (sz::string_view token : sz::string_view(text).split(sz::char_set(delimiters))) {
if (!token.empty()) callback(token);
}
}
It should be several times faster at runtime than any other C/C++ library that does not use AVX-512 on x86 or Neon on Arm.
Upvotes: 2
Reputation: 661
Using std::string_view
and Eric Niebler's range-v3
library:
https://wandbox.org/permlink/kW5lwRCL1pxjp2pW
#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"
#include "range/v3/algorithm.hpp"
int main() {
std::string s = "Somewhere down the range v3 library";
ranges::for_each(s
| ranges::view::split(' ')
| ranges::view::transform([](auto &&sub) {
return std::string_view(&*sub.begin(), ranges::distance(sub));
}),
[](auto s) {std::cout << "Substring: " << s << "\n";}
);
}
By using a range for
loop instead of ranges::for_each
algorithm:
#include <iostream>
#include <string>
#include <string_view>
#include "range/v3/view.hpp"
int main()
{
std::string str = "Somewhere down the range v3 library";
for (auto s : str | ranges::view::split(' ')
| ranges::view::transform([](auto&& sub) { return std::string_view(&*sub.begin(), ranges::distance(sub)); }
))
{
std::cout << "Substring: " << s << "\n";
}
}
As sehe pointed out in the comments, the preferred std::ranges
solution should be:
#include <iomanip>
#include <iostream>
#include <ranges>
#include <string_view>
using namespace std::literals;
int main() {
static constexpr auto str = "Somewhere down the c++20 standard library"sv;
for (auto [b, e] : str | std::ranges::views::split(' '))
std::cout << "Substring: " << quoted(std::string_view(b, e)) << "\n";
}
Even compiling with -std=c++20
should do the trick.
Upvotes: 23
Reputation: 19700
Here's a simple solution that uses only the standard regex library
#include <regex>
#include <string>
#include <vector>
std::vector<std::string> Tokenize( const string str, const std::regex regex )
{
using namespace std;
std::vector<string> result;
sregex_token_iterator it( str.begin(), str.end(), regex, -1 );
sregex_token_iterator reg_end;
for ( ; it != reg_end; ++it ) {
if ( !it->str().empty() ) //token could be empty:check
result.emplace_back( it->str() );
}
return result;
}
The regex argument allows checking for multiple arguments (spaces, commas, etc.)
I usually only check to split on spaces and commas, so I also have this default function:
std::vector<std::string> TokenizeDefault( const string str )
{
using namespace std;
regex re( "[\\s,]+" );
return Tokenize( str, re );
}
The "[\\s,]+"
checks for spaces (\\s
) and commas (,
).
Note, if you want to split wstring
instead of string
,
std::regex
to std::wregex
sregex_token_iterator
to wsregex_token_iterator
Note, you might also want to take the string argument by reference, depending on your compiler.
Upvotes: 37
Reputation: 9970
It is possible to iterate over the words of the input string without doing any heap allocations building intermediate data structures like a vector of substrings. This Spliterator
class returns a string_view
for each substring, avoiding allocations.
struct Spliterator
{
Spliterator(string_view sentence) : sentence_(sentence), word_end_(sentence.begin())
{
next();
}
operator string_view() const { return {word_begin_, word_end_};}
void next()
{
word_begin_ = word_end_;
while(word_begin_ != sentence_.end() && std::isspace(*word_begin_)) ++word_begin_;
word_end_ = word_begin_;
while(word_end_ != sentence_.end() && (!std::isspace(*word_end_))) ++word_end_;
}
string_view sentence_;
string_view::iterator word_begin_;
string_view::iterator word_end_;
};
Usage:
void process_word(string_view word); // Do whatever you want with the words.
void process_words(string_view sentence)
{
Spliterator spliterator {sentence};
string_view word;
while((word = spliterator).length() > 0)
{
process_word(word);
spliterator.next();
}
}
This idea can be generalised to take a user-specified set of splitting characters.
Upvotes: 0
Reputation: 3511
An efficient, small, and elegant solution using a template function:
template <class ContainerT>
void split(const std::string& str, ContainerT& tokens,
const std::string& delimiters = " ", bool trimEmpty = false)
{
std::string::size_type pos, lastPos = 0, length = str.length();
using value_type = typename ContainerT::value_type;
using size_type = typename ContainerT::size_type;
while (lastPos < length + 1)
{
pos = str.find_first_of(delimiters, lastPos);
if (pos == std::string::npos)
pos = length;
if (pos != lastPos || !trimEmpty)
tokens.emplace_back(value_type(str.data() + lastPos,
(size_type)pos - lastPos));
lastPos = pos + 1;
}
}
I usually choose to use std::vector<std::string>
types as my second parameter (ContainerT
)... but list<...>
may sometimes be preferred over vector<...>
.
It also allows you to specify whether to trim empty tokens from the results via a last optional parameter.
All it requires is std::string
included via <string>
. It does not use streams or the boost library explicitly but will be able to accept some of these types.
Also since C++-17 you can use std::vector<std::string_view>
which is much faster and more memory-efficient than using std::string
. Here is a revised version which also supports the container as a return type:
#include <vector>
#include <string_view>
#include <utility>
template < typename StringT,
typename DelimiterT = char,
typename ContainerT = std::vector<std::string_view> >
ContainerT split(StringT const& str, DelimiterT const& delimiters = ' ', bool trimEmpty = true, ContainerT&& tokens = {})
{
typename StringT::size_type pos, lastPos = 0, length = str.length();
while (lastPos < length + 1)
{
pos = str.find_first_of(delimiters, lastPos);
if (pos == StringT::npos)
pos = length;
if (pos != lastPos || !trimEmpty)
tokens.emplace_back(str.data() + lastPos, pos - lastPos);
lastPos = pos + 1;
}
return std::forward<ContainerT>(tokens);
}
Care has been taken not to make any unneeded copies.
This will allow for either:
for (auto const& line : split(str, '\n'))
Or:
auto& lines = split(str, '\n');
Both returning the default template container type of std::vector<std::string_view>
.
To get a specific container type back, or to pass an existing container, use the tokens
input parameter with either a typed initial container or an existing container variable:
auto& lines = split(str, '\n', false, std::vector<std::string>());
Or:
std::vector<std::string> lines;
split(str, '\n', false, lines);
Upvotes: 202
Reputation: 577
C++20 finally blesses us with a split
function. Or rather, a range adapter. Godbolt link.
#include <iostream>
#include <ranges>
#include <string_view>
namespace ranges = std::ranges;
namespace views = std::views;
using str = std::string_view;
auto view =
"Multiple words"
| views::split(' ')
| views::transform([](auto &&r) -> str {
return str(r.begin(), r.end());
});
auto main() -> int {
for (str &&sv : view) {
std::cout << sv << '\n';
}
}
Upvotes: 29
Reputation: 4427
The following is a much better way to do this. It can take any character and doesn't split lines unless you want. No special libraries are needed (well, besides std
, but who really considers that an extra library) No pointers or references are needed, and it's static. Just simple plain C++.
#pragma once
#include <vector>
#include <sstream>
using namespace std;
class Helpers
{
public:
static vector<string> split(string s, char delim)
{
stringstream temp (stringstream::in | stringstream::out);
vector<string> elems(0);
if (s.size() == 0 || delim == 0)
return elems;
for(char c : s)
{
if(c == delim)
{
elems.push_back(temp.str());
temp = stringstream(stringstream::in | stringstream::out);
}
else
temp << c;
}
if (temp.str().size() > 0)
elems.push_back(temp.str());
return elems;
}
//Splits string s with a list of delimiters in delims (it's just a list, like if we wanted to
//split at the following letters, a, b, c we would make delims="abc".
static vector<string> split(string s, string delims)
{
stringstream temp (stringstream::in | stringstream::out);
vector<string> elems(0);
bool found;
if(s.size() == 0 || delims.size() == 0)
return elems;
for(char c : s)
{
found = false;
for(char d : delims)
{
if (c == d)
{
elems.push_back(temp.str());
temp = stringstream(stringstream::in | stringstream::out);
found = true;
break;
}
}
if(!found)
temp << c;
}
if(temp.str().size() > 0)
elems.push_back(temp.str());
return elems;
}
};
Upvotes: 6
Reputation: 605
I made this because I needed an easy way to split strings and C-based strings. Hopefully someone else can find it useful as well. Also, it doesn't rely on tokens, and you can use fields as delimiters, which is another key I needed.
I'm sure there are improvements that can be made to even further improve its elegance, and please do by all means.
StringSplitter.hpp:
#include <vector>
#include <iostream>
#include <string.h>
using namespace std;
class StringSplit
{
private:
void copy_fragment(char*, char*, char*);
void copy_fragment(char*, char*, char);
bool match_fragment(char*, char*, int);
int untilnextdelim(char*, char);
int untilnextdelim(char*, char*);
void assimilate(char*, char);
void assimilate(char*, char*);
bool string_contains(char*, char*);
long calc_string_size(char*);
void copy_string(char*, char*);
public:
vector<char*> split_cstr(char);
vector<char*> split_cstr(char*);
vector<string> split_string(char);
vector<string> split_string(char*);
char* String;
bool do_string;
bool keep_empty;
vector<char*> Container;
vector<string> ContainerS;
StringSplit(char * in)
{
String = in;
}
StringSplit(string in)
{
size_t len = calc_string_size((char*)in.c_str());
String = new char[len + 1];
memset(String, 0, len + 1);
copy_string(String, (char*)in.c_str());
do_string = true;
}
~StringSplit()
{
for (int i = 0; i < Container.size(); i++)
{
if (Container[i] != NULL)
{
delete[] Container[i];
}
}
if (do_string)
{
delete[] String;
}
}
};
StringSplitter.cpp:
#include <string.h>
#include <iostream>
#include <vector>
#include "StringSplit.hpp"
using namespace std;
void StringSplit::assimilate(char*src, char delim)
{
int until = untilnextdelim(src, delim);
if (until > 0)
{
char * temp = new char[until + 1];
memset(temp, 0, until + 1);
copy_fragment(temp, src, delim);
if (keep_empty || *temp != 0)
{
if (!do_string)
{
Container.push_back(temp);
}
else
{
string x = temp;
ContainerS.push_back(x);
}
}
else
{
delete[] temp;
}
}
}
void StringSplit::assimilate(char*src, char* delim)
{
int until = untilnextdelim(src, delim);
if (until > 0)
{
char * temp = new char[until + 1];
memset(temp, 0, until + 1);
copy_fragment(temp, src, delim);
if (keep_empty || *temp != 0)
{
if (!do_string)
{
Container.push_back(temp);
}
else
{
string x = temp;
ContainerS.push_back(x);
}
}
else
{
delete[] temp;
}
}
}
long StringSplit::calc_string_size(char* _in)
{
long i = 0;
while (*_in++)
{
i++;
}
return i;
}
bool StringSplit::string_contains(char* haystack, char* needle)
{
size_t len = calc_string_size(needle);
size_t lenh = calc_string_size(haystack);
while (lenh--)
{
if (match_fragment(haystack + lenh, needle, len))
{
return true;
}
}
return false;
}
bool StringSplit::match_fragment(char* _src, char* cmp, int len)
{
while (len--)
{
if (*(_src + len) != *(cmp + len))
{
return false;
}
}
return true;
}
int StringSplit::untilnextdelim(char* _in, char delim)
{
size_t len = calc_string_size(_in);
if (*_in == delim)
{
_in += 1;
return len - 1;
}
int c = 0;
while (*(_in + c) != delim && c < len)
{
c++;
}
return c;
}
int StringSplit::untilnextdelim(char* _in, char* delim)
{
int s = calc_string_size(delim);
int c = 1 + s;
if (!string_contains(_in, delim))
{
return calc_string_size(_in);
}
else if (match_fragment(_in, delim, s))
{
_in += s;
return calc_string_size(_in);
}
while (!match_fragment(_in + c, delim, s))
{
c++;
}
return c;
}
void StringSplit::copy_fragment(char* dest, char* src, char delim)
{
if (*src == delim)
{
src++;
}
int c = 0;
while (*(src + c) != delim && *(src + c))
{
*(dest + c) = *(src + c);
c++;
}
*(dest + c) = 0;
}
void StringSplit::copy_string(char* dest, char* src)
{
int i = 0;
while (*(src + i))
{
*(dest + i) = *(src + i);
i++;
}
}
void StringSplit::copy_fragment(char* dest, char* src, char* delim)
{
size_t len = calc_string_size(delim);
size_t lens = calc_string_size(src);
if (match_fragment(src, delim, len))
{
src += len;
lens -= len;
}
int c = 0;
while (!match_fragment(src + c, delim, len) && (c < lens))
{
*(dest + c) = *(src + c);
c++;
}
*(dest + c) = 0;
}
vector<char*> StringSplit::split_cstr(char Delimiter)
{
int i = 0;
while (*String)
{
if (*String != Delimiter && i == 0)
{
assimilate(String, Delimiter);
}
if (*String == Delimiter)
{
assimilate(String, Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return Container;
}
vector<string> StringSplit::split_string(char Delimiter)
{
do_string = true;
int i = 0;
while (*String)
{
if (*String != Delimiter && i == 0)
{
assimilate(String, Delimiter);
}
if (*String == Delimiter)
{
assimilate(String, Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return ContainerS;
}
vector<char*> StringSplit::split_cstr(char* Delimiter)
{
int i = 0;
size_t LenDelim = calc_string_size(Delimiter);
while(*String)
{
if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
{
assimilate(String, Delimiter);
}
if (match_fragment(String, Delimiter, LenDelim))
{
assimilate(String,Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return Container;
}
vector<string> StringSplit::split_string(char* Delimiter)
{
do_string = true;
int i = 0;
size_t LenDelim = calc_string_size(Delimiter);
while (*String)
{
if (!match_fragment(String, Delimiter, LenDelim) && i == 0)
{
assimilate(String, Delimiter);
}
if (match_fragment(String, Delimiter, LenDelim))
{
assimilate(String, Delimiter);
}
i++;
String++;
}
String -= i;
delete[] String;
return ContainerS;
}
Examples:
int main(int argc, char*argv[])
{
StringSplit ss = "This:CUT:is:CUT:an:CUT:example:CUT:cstring";
vector<char*> Split = ss.split_cstr(":CUT:");
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
Will output:
This
is
an
example
cstring
int main(int argc, char*argv[])
{
StringSplit ss = "This:is:an:example:cstring";
vector<char*> Split = ss.split_cstr(':');
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
int main(int argc, char*argv[])
{
string mystring = "This[SPLIT]is[SPLIT]an[SPLIT]example[SPLIT]string";
StringSplit ss = mystring;
vector<string> Split = ss.split_string("[SPLIT]");
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
int main(int argc, char*argv[])
{
string mystring = "This|is|an|example|string";
StringSplit ss = mystring;
vector<string> Split = ss.split_string('|');
for (int i = 0; i < Split.size(); i++)
{
cout << Split[i] << endl;
}
return 0;
}
To keep empty entries (by default empties will be excluded):
StringSplit ss = mystring;
ss.keep_empty = true;
vector<string> Split = ss.split_string(":DELIM:");
The goal was to make it similar to C#'s Split() method where splitting a string is as easy as:
String[] Split =
"Hey:cut:what's:cut:your:cut:name?".Split(new[]{":cut:"}, StringSplitOptions.None);
foreach(String X in Split)
{
Console.Write(X);
}
I hope someone else can find this as useful as I do.
Upvotes: 13
Reputation: 12801
Some C++20 compilers and most of the C++23 compilers (ranges
and string_view
)
for (auto word : std::views::split("Somewhere down the road", ' '))
std::cout << std::string_view{ word.begin(), word.end() } << std::endl;
Upvotes: 2
Reputation: 90513
I use this to split string by a delimiter. The first puts the results in a pre-constructed vector, the second returns a new vector.
#include <string>
#include <sstream>
#include <vector>
#include <iterator>
template <typename Out>
void split(const std::string &s, char delim, Out result) {
std::istringstream iss(s);
std::string item;
while (std::getline(iss, item, delim)) {
*result++ = item;
}
}
std::vector<std::string> split(const std::string &s, char delim) {
std::vector<std::string> elems;
split(s, delim, std::back_inserter(elems));
return elems;
}
Note that this solution does not skip empty tokens, so the following will find 4 items, one of which is empty:
std::vector<std::string> x = split("one:two::three", ':');
Upvotes: 2590
Reputation: 275820
Yet another way -- continuation passing style, zero allocation, function based delimiting.
void split( auto&& data, auto&& splitter, auto&& operation ) {
using std::begin; using std::end;
auto prev = begin(data);
while (prev != end(data) ) {
auto&&[prev,next] = splitter( prev, end(data) );
operation(prev,next);
prev = next;
}
}
Now we can write specific split functions based off this.
auto anyOfSplitter(auto delimiters) {
return [delimiters](auto begin, auto end) {
while( begin != end && 0 == std::string_view(begin, end).find_first_of(delimiters) ) {
++begin;
}
auto view = std::string_view(begin, end);
auto next = view.find_first_of(delimiters);
if (next != view.npos)
return std::make_pair( begin, begin + next );
else
return std::make_pair( begin, end );
};
}
we can now produce a traditional std string split like this:
template<class C>
auto traditional_any_of_split( std::string_view<C> str, std::string_view<C> delim ) {
std::vector<std::basic_string<C>> retval;
split( str, anyOfSplitter(delim), [&](auto s, auto f) {
retval.emplace_back(s,f);
});
return retval;
}
or we can use find instead
auto findSplitter(auto delimiter) {
return [delimiter](auto begin, auto end) {
while( begin != end && 0 == std::string_view(begin, end).find(delimiter) ) {
begin += delimiter.size();
}
auto view = std::string_view(begin, end);
auto next = view.find(delimiter);
if (next != view.npos)
return std::make_pair( begin, begin + next );
else
return std::make_pair( begin, end );
};
}
template<class C>
auto traditional_find_split( std::string_view<C> str, std::string_view<C> delim ) {
std::vector<std::basic_string<C>> retval;
split( str, findSplitter(delim), [&](auto s, auto f) {
retval.emplace_back(s,f);
});
return retval;
}
by replacing the splitter portion.
Both of these allocate a buffer of return values. We can swap the return values to string views at the cost of manually managing lifetime.
We can also take a continuation that will get passed the string views one at a time, avoiding even allocating the vector of views.
This can be extended with an abort option, so that we can abort after reading a few prefix strings.
Upvotes: 1
Reputation: 27618
I cannot believe how overly complicated most of these answers were. Why didnt someone suggest something as simple as this?
#include <iostream>
#include <sstream>
std::string input = "This is a sentence to read";
std::istringstream ss(input);
std::string token;
while(std::getline(ss, token, ' ')) {
std::cout << token << endl;
}
Upvotes: 11
Reputation: 1314
A minimal solution is a function which takes as input a std::string
and a set of delimiter characters (as a std::string
), and returns a std::vector
of std::strings
.
#include <string>
#include <vector>
std::vector<std::string>
tokenize(const std::string& str, const std::string& delimiters)
{
using ssize_t = std::string::size_type;
const ssize_t str_ln = str.length();
ssize_t last_pos = 0;
// container for the extracted tokens
std::vector<std::string> tokens;
while (last_pos < str_ln) {
// find the position of the next delimiter
ssize_t pos = str.find_first_of(delimiters, last_pos);
// if no delimiters found, set the position to the length of string
if (pos == std::string::npos)
pos = str_ln;
// if the substring is nonempty, store it in the container
if (pos != last_pos)
tokens.emplace_back(str.substr(last_pos, pos - last_pos));
// scan past the previous substring
last_pos = pos + 1;
}
return tokens;
}
A usage example:
#include <iostream>
int main()
{
std::string input_str = "one + two * (three - four)!!---! ";
const char* delimiters = "! +- (*)";
std::vector<std::string> tokens = tokenize(input_str, delimiters);
std::cout << "input = '" << input_str << "'\n"
<< "delimiters = '" << delimiters << "'\n"
<< "nr of tokens found = " << tokens.size() << std::endl;
for (const std::string& tk : tokens) {
std::cout << "token = '" << tk << "'\n";
}
return 0;
}
Upvotes: 1
Reputation: 100718
This is similar to Stack Overflow question How do I tokenize a string in C++?. Requires Boost external library
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
using namespace std;
using namespace boost;
int main(int argc, char** argv)
{
string text = "token test\tstring";
char_separator<char> sep(" \t");
tokenizer<char_separator<char>> tokens(text, sep);
for (const string& t : tokens)
{
cout << t << "." << endl;
}
}
Upvotes: 89
Reputation:
There's a way easier method to do this!!
#include <vector>
#include <string>
std::vector<std::string> splitby(std::string string, char splitter) {
int splits = 0;
std::vector<std::string> result = {};
std::string locresult = "";
for (unsigned int i = 0; i < string.size(); i++) {
if ((char)string.at(i) != splitter) {
locresult += string.at(i);
}
else {
result.push_back(locresult);
locresult = "";
}
}
if (splits == 0) {
result.push_back(locresult);
}
return result;
}
void printvector(std::vector<std::string> v) {
std::cout << '{';
for (unsigned int i = 0; i < v.size(); i++) {
if (i < v.size() - 1) {
std::cout << '"' << v.at(i) << "\",";
}
else {
std::cout << '"' << v.at(i) << "\"";
}
}
std::cout << "}\n";
}
Upvotes: 0
Reputation: 1163
Although there was some answer providing C++20 solution, since it was posted there were some changes made and applied to C++20 as Defect Reports. Because of that the solution is a little bit shorter and nicer:
#include <iostream>
#include <ranges>
#include <string_view>
namespace views = std::views;
using str = std::string_view;
constexpr str text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.";
auto splitByWords(str input) {
return input
| views::split(' ')
| views::transform([](auto &&r) -> str {
return {r.begin(), r.end()};
});
}
auto main() -> int {
for (str &&word : splitByWords(text)) {
std::cout << word << '\n';
}
}
As of today it is still available only on the trunk branch of GCC (Godbolt link). It is based on two changes: P1391 iterator constructor for std::string_view
and P2210 DR fixing std::views::split
to preserve range type.
In C++23 there won't be any transform
boilerplate needed, since P1989 adds a range constructor to std::string_view:
#include <iostream>
#include <ranges>
#include <string_view>
namespace views = std::views;
constexpr std::string_view text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.";
auto main() -> int {
for (std::string_view&& word : text | views::split(' ')) {
std::cout << word << '\n';
}
}
Upvotes: 9
Reputation: 72
Everyone answered for predefined string input. I think this answer will help someone for scanned input.
I used tokens vector for holding string tokens. It's optional.
#include <bits/stdc++.h>
using namespace std ;
int main()
{
string str, token ;
getline(cin, str) ; // get the string as input
istringstream ss(str); // insert the string into tokenizer
vector<string> tokens; // vector tokens holds the tokens
while (ss >> token) tokens.push_back(token); // splits the tokens
for(auto x : tokens) cout << x << endl ; // prints the tokens
return 0;
}
sample input:
port city international university
sample output:
port
city
international
university
Note that by default this will work for only space as the delimiter. you can use custom delimiter. For that, you have customized the code. let the delimiter be ','. so use
char delimiter = ',' ;
while(getline(ss, token, delimiter)) tokens.push_back(token) ;
instead of
while (ss >> token) tokens.push_back(token);
Upvotes: 6
Reputation: 27684
C++17 version without any memory allocation (except may be for std::function
)
void iter_words(const std::string_view& input, const std::function<void(std::string_view)>& process_word) {
auto itr = input.begin();
auto consume_whitespace = [&]() {
for(; itr != input.end(); ++itr) {
if(!isspace(*itr))
return;
}
};
auto consume_letters = [&]() {
for(; itr != input.end(); ++itr) {
if(isspace(*itr))
return;
}
};
while(true) {
consume_whitespace();
if(itr == input.end())
return;
auto word_start = itr - input.begin();
consume_letters();
auto word_end = itr - input.begin();
process_word(input.substr(word_start, word_end - word_start));
}
}
int main() {
iter_words("foo bar", [](std::string_view sv) {
std::cout << "Got word: " << sv << '\n';
});
return 0;
}
Upvotes: 2
Reputation: 2034
For what it's worth, here's another way to extract tokens from an input string, relying only on standard library facilities. It's an example of the power and elegance behind the design of the STL.
#include <iostream>
#include <string>
#include <sstream>
#include <algorithm>
#include <iterator>
int main() {
using namespace std;
string sentence = "And I feel fine...";
istringstream iss(sentence);
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
ostream_iterator<string>(cout, "\n"));
}
Instead of copying the extracted tokens to an output stream, one could insert them into a container, using the same generic copy
algorithm.
vector<string> tokens;
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
back_inserter(tokens));
... or create the vector
directly:
vector<string> tokens{istream_iterator<string>{iss},
istream_iterator<string>{}};
Upvotes: 1512
Reputation: 1911
I have a very different approach from the other solutions that offers a lot of value in ways that the other solutions are variously lacking, but of course also has its own down sides. Here is the working implementation, with the example of putting <tag></tag>
around words.
For a start, this problem can be solved with one loop, no additional memory, and by considering merely four logical cases. Conceptually, we're interested in boundaries. Our code should reflect that: let's iterate through the string and look at two characters at a time, bearing in mind that we have special cases at the start and end of the string.
The downside is that we have to write the implementation, which is somewhat verbose, but mostly convenient boilerplate.
The upside is that we wrote the implementation, so it is very easy to customize it to specific needs, such as distinguishing left and write word boundaries, using any set of delimiters, or handling other cases such as non-boundary or erroneous positions.
using namespace std;
#include <iostream>
#include <string>
#include <cctype>
typedef enum boundary_type_e {
E_BOUNDARY_TYPE_ERROR = -1,
E_BOUNDARY_TYPE_NONE,
E_BOUNDARY_TYPE_LEFT,
E_BOUNDARY_TYPE_RIGHT,
} boundary_type_t;
typedef struct boundary_s {
boundary_type_t type;
int pos;
} boundary_t;
bool is_delim_char(int c) {
return isspace(c); // also compare against any other chars you want to use as delimiters
}
bool is_word_char(int c) {
return ' ' <= c && c <= '~' && !is_delim_char(c);
}
boundary_t maybe_word_boundary(string str, int pos) {
int len = str.length();
if (pos < 0 || pos >= len) {
return (boundary_t){.type = E_BOUNDARY_TYPE_ERROR};
} else {
if (pos == 0 && is_word_char(str[pos])) {
// if the first character is word-y, we have a left boundary at the beginning
return (boundary_t){.type = E_BOUNDARY_TYPE_LEFT, .pos = pos};
} else if (pos == len - 1 && is_word_char(str[pos])) {
// if the last character is word-y, we have a right boundary left of the null terminator
return (boundary_t){.type = E_BOUNDARY_TYPE_RIGHT, .pos = pos + 1};
} else if (!is_word_char(str[pos]) && is_word_char(str[pos + 1])) {
// if we have a delimiter followed by a word char, we have a left boundary left of the word char
return (boundary_t){.type = E_BOUNDARY_TYPE_LEFT, .pos = pos + 1};
} else if (is_word_char(str[pos]) && !is_word_char(str[pos + 1])) {
// if we have a word char followed by a delimiter, we have a right boundary right of the word char
return (boundary_t){.type = E_BOUNDARY_TYPE_RIGHT, .pos = pos + 1};
}
return (boundary_t){.type = E_BOUNDARY_TYPE_NONE};
}
}
int main() {
string str;
getline(cin, str);
int len = str.length();
for (int i = 0; i < len; i++) {
boundary_t boundary = maybe_word_boundary(str, i);
if (boundary.type == E_BOUNDARY_TYPE_LEFT) {
// whatever
} else if (boundary.type == E_BOUNDARY_TYPE_RIGHT) {
// whatever
}
}
}
As you can see, the code is very simple to understand and fine tune, and the actual usage of the code is very short and simple. Using C++ should not stop us from writing the simplest and most readily customized code possible, even if that means not using the STL. I would think this is an instance of what Linus Torvalds might call "taste", since we have eliminated all the logic we don't need while writing in a style that naturally allows more cases to be handled when and if the need to handle them arises.
What could improve this code might be the use of enum class
, accepting a function pointer to is_word_char
in maybe_word_boundary
instead of invoking is_word_char
directly, and passing a lambda.
Upvotes: 1
Reputation: 161914
#include <vector>
#include <string>
#include <sstream>
int main()
{
std::string str("Split me by whitespaces");
std::string buf; // Have a buffer string
std::stringstream ss(str); // Insert the string into a stream
std::vector<std::string> tokens; // Create vector to hold our words
while (ss >> buf)
tokens.push_back(buf);
return 0;
}
Upvotes: 414
Reputation: 9003
Using std::stringstream
as you have works perfectly fine, and do exactly what you wanted. If you're just looking for different way of doing things though, you can use std::find()
/std::find_first_of()
and std::string::substr()
.
Here's an example:
#include <iostream>
#include <string>
int main()
{
std::string s("Somewhere down the road");
std::string::size_type prev_pos = 0, pos = 0;
while( (pos = s.find(' ', pos)) != std::string::npos )
{
std::string substring( s.substr(prev_pos, pos-prev_pos) );
std::cout << substring << '\n';
prev_pos = ++pos;
}
std::string substring( s.substr(prev_pos, pos-prev_pos) ); // Last word
std::cout << substring << '\n';
return 0;
}
Upvotes: 33
Reputation: 5565
This is my favorite way to iterate through a string. You can do whatever you want per word.
string line = "a line of text to iterate through";
string word;
istringstream iss(line, istringstream::in);
while( iss >> word )
{
// Do something on `word` here...
}
Upvotes: 142
Reputation:
The STL does not have such a method available already.
However, you can either use C's strtok()
function by using the std::string::c_str()
member, or you can write your own. Here is a code sample I found after a quick Google search ("STL string split"):
void Tokenize(const string& str,
vector<string>& tokens,
const string& delimiters = " ")
{
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}
Taken from: http://oopweb.com/CPP/Documents/CPPHOWTO/Volume/C++Programming-HOWTO-7.html
If you have questions about the code sample, leave a comment and I will explain.
And just because it does not implement a typedef
called iterator or overload the <<
operator does not mean it is bad code. I use C functions quite frequently. For example, printf
and scanf
both are faster than std::cin
and std::cout
(significantly), the fopen
syntax is a lot more friendly for binary types, and they also tend to produce smaller EXEs.
Don't get sold on this "Elegance over performance" deal.
Upvotes: 60
Reputation: 438
This answer takes the string and puts it into a vector of strings. It uses the boost library.
#include <boost/algorithm/string.hpp>
std::vector<std::string> strs;
boost::split(strs, "string to split", boost::is_any_of("\t "));
Upvotes: 13
Reputation: 506
#include <iostream>
#include <string>
#include <deque>
std::deque<std::string> split(
const std::string& line,
std::string::value_type delimiter,
bool skipEmpty = false
) {
std::deque<std::string> parts{};
if (!skipEmpty && !line.empty() && delimiter == line.at(0)) {
parts.push_back({});
}
for (const std::string::value_type& c : line) {
if (
(
c == delimiter
&&
(skipEmpty ? (!parts.empty() && !parts.back().empty()) : true)
)
||
(c != delimiter && parts.empty())
) {
parts.push_back({});
}
if (c != delimiter) {
parts.back().push_back(c);
}
}
if (skipEmpty && !parts.empty() && parts.back().empty()) {
parts.pop_back();
}
return parts;
}
void test(const std::string& line) {
std::cout << line << std::endl;
std::cout << "skipEmpty=0 |";
for (const std::string& part : split(line, ':')) {
std::cout << part << '|';
}
std::cout << std::endl;
std::cout << "skipEmpty=1 |";
for (const std::string& part : split(line, ':', true)) {
std::cout << part << '|';
}
std::cout << std::endl;
std::cout << std::endl;
}
int main() {
test("foo:bar:::baz");
test("");
test("foo");
test(":");
test("::");
test(":foo");
test("::foo");
test(":foo:");
test(":foo::");
return 0;
}
Output:
foo:bar:::baz
skipEmpty=0 |foo|bar|||baz|
skipEmpty=1 |foo|bar|baz|
skipEmpty=0 |
skipEmpty=1 |
foo
skipEmpty=0 |foo|
skipEmpty=1 |foo|
:
skipEmpty=0 |||
skipEmpty=1 |
::
skipEmpty=0 ||||
skipEmpty=1 |
:foo
skipEmpty=0 ||foo|
skipEmpty=1 |foo|
::foo
skipEmpty=0 |||foo|
skipEmpty=1 |foo|
:foo:
skipEmpty=0 ||foo||
skipEmpty=1 |foo|
:foo::
skipEmpty=0 ||foo|||
skipEmpty=1 |foo|
Upvotes: 0
Reputation: 1429
Not that we need more answers, but this is what I came up with after being inspired by Evan Teran.
std::vector <std::string> split(const string &input, auto delimiter, bool skipEmpty=true) {
/*
Splits a string at each delimiter and returns these strings as a string vector.
If the delimiter is not found then nothing is returned.
If skipEmpty is true then strings between delimiters that are 0 in length will be skipped.
*/
bool delimiterFound = false;
int pos=0, pPos=0;
std::vector <std::string> result;
while (true) {
pos = input.find(delimiter,pPos);
if (pos != std::string::npos) {
if (skipEmpty==false or pos-pPos > 0) // if empty values are to be kept or not
result.push_back(input.substr(pPos,pos-pPos));
delimiterFound = true;
} else {
if (pPos < input.length() and delimiterFound) {
if (skipEmpty==false or input.length()-pPos > 0) // if empty values are to be kept or not
result.push_back(input.substr(pPos,input.length()-pPos));
}
break;
}
pPos = pos+1;
}
return result;
}
Upvotes: -1
Reputation: 1966
Yes, I looked through all 30 examples.
I couldn't find a version of split
that works for multi-char delimiters, so here's mine:
#include <string>
#include <vector>
using namespace std;
vector<string> split(const string &str, const string &delim)
{
const auto delim_pos = str.find(delim);
if (delim_pos == string::npos)
return {str};
vector<string> ret{str.substr(0, delim_pos)};
auto tail = split(str.substr(delim_pos + delim.size(), string::npos), delim);
ret.insert(ret.end(), tail.begin(), tail.end());
return ret;
}
Probably not the most efficient of implementations, but it's a very straightforward recursive solution, using only <string>
and <vector>
.
Ah, it's written in C++11, but there's nothing special about this code, so you could easily adapt it to C++98.
Upvotes: 2
Reputation: 141
my general implementation for string
and u32string
~, using the boost::algorithm::split
signature.
template<typename CharT, typename UnaryPredicate>
void split(std::vector<std::basic_string<CharT>>& split_result,
const std::basic_string<CharT>& s,
UnaryPredicate predicate)
{
using ST = std::basic_string<CharT>;
using std::swap;
std::vector<ST> tmp_result;
auto iter = s.cbegin(),
end_iter = s.cend();
while (true)
{
/**
* edge case: empty str -> push an empty str and exit.
*/
auto find_iter = find_if(iter, end_iter, predicate);
tmp_result.emplace_back(iter, find_iter);
if (find_iter == end_iter) { break; }
iter = ++find_iter;
}
swap(tmp_result, split_result);
}
template<typename CharT>
void split(std::vector<std::basic_string<CharT>>& split_result,
const std::basic_string<CharT>& s,
const std::basic_string<CharT>& char_candidate)
{
std::unordered_set<CharT> candidate_set(char_candidate.cbegin(),
char_candidate.cend());
auto predicate = [&candidate_set](const CharT& c) {
return candidate_set.count(c) > 0U;
};
return split(split_result, s, predicate);
}
template<typename CharT>
void split(std::vector<std::basic_string<CharT>>& split_result,
const std::basic_string<CharT>& s,
const CharT* literals)
{
return split(split_result, s, std::basic_string<CharT>(literals));
}
Upvotes: 0
Reputation: 3116
Here is a split function that:
ignores empty tokens (can easily be changed)
template<typename T>
vector<T>
split(const T & str, const T & delimiters) {
vector<T> v;
typename T::size_type start = 0;
auto pos = str.find_first_of(delimiters, start);
while(pos != T::npos) {
if(pos != start) // ignore empty tokens
v.emplace_back(str, start, pos - start);
start = pos + 1;
pos = str.find_first_of(delimiters, start);
}
if(start < str.length()) // ignore trailing delimiter
v.emplace_back(str, start, str.length() - start); // add what's left of the string
return v;
}
Example usage:
vector<string> v = split<string>("Hello, there; World", ";,");
vector<wstring> v = split<wstring>(L"Hello, there; World", L";,");
Upvotes: 45