Reputation: 601
I try to tokenize a string such that
"word1 word2 word3 word4"
will be tokenized to 4 strings: "word1"
, "word2"
, "word3"
and "word4"
"word1 \"word2 word3\" word4"
will be tokenized to 3 strings: "word1"
, "word2 word3"
and "word4"
I have written a function tokenizeQuoted()
which does the job. That function does check after reading each token if an error occurred by checking the failbit
of the stream.
#include <cstring>
#include <cwchar>
#include <iomanip>
#include <sstream>
#include <stdexcept>
#include <string>
#include <vector>
// Tokenize a string using std::quoted
template <typename CharType>
std::vector<std::basic_string<CharType>> tokenizeQuoted(const std::basic_string<CharType> &input)
{
std::basic_istringstream<CharType> iss(input);
std::vector<std::basic_string<CharType>> tokens;
std::basic_string<CharType> token;
while (!iss.eof())
{
iss >> std::quoted(token);
if (iss.fail())
{
throw std::runtime_error("failed to tokenize string: '" + input + "'; bad bit = " + (iss.bad() ? "true" : "false"));
}
tokens.push_back(token);
}
return tokens;
}
int main() {
const std::string inputMars = "\"hello mars\"!"; // note the '!' at the end
const std::string inputEarth = "\"hello earth\"";
const auto mars = tokenizeQuoted(inputMars); // OK
const auto earth = tokenizeQuoted(inputEarth); // failbit is set
return mars.size() + earth.size();
}
In general the function works. But in case the input string ends with a quoted string (like "say \"good day\""
), the failbit is set. I would not expect that. What can I do to reliably detect errors and still be able to extract quoted strings at the end of the sequence?
Upvotes: 2
Views: 88
Reputation: 1432
This doesn't give a definite answer as to why it is happening, but I did find a workaround. Basically, I observed that even though std::quoted
is moving the position indicator to the correct place, it may not be doing an eof check and setting the bit correctly. The workaround is to check if the eof bit is set, and if it is not calling peek()
. Peeking the next value will do nothing if we are not at the end of the file, but if we are it will correctly update the bits. I prove that this works by printing the token extracted, as well as the stream bits before and after calling peek.
#include <cstring>
#include <cwchar>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <stdexcept>
#include <string>
#include <vector>
// Tokenize a string using std::quoted
template <typename CharType>
std::vector<std::basic_string<CharType>> tokenizeQuoted(const std::basic_string<CharType> &input)
{
std::basic_istringstream<CharType> iss(input);
std::vector<std::basic_string<CharType>> tokens;
std::basic_string<CharType> token;
while (iss >> std::ws && !iss.eof())
{
token.clear();
iss >> std::quoted(token);
if (iss.fail())
{
throw std::runtime_error("failed to tokenize string: '" + input + "'; bad bit = " + (iss.bad() ? "true" : "false"));
}
tokens.push_back(token);
std::cerr << "Token : " << token << std::endl;
std::cerr << "Before: " << iss.good() << iss.eof() << iss.fail() << iss.bad() << std::endl;
if (!iss.eof()) {
iss.peek();
}
std::cerr << "After : " << iss.good() << iss.eof() << iss.fail() << iss.bad() << std::endl << std::endl;
}
return tokens;
}
int main()
{
std::vector<std::string> inputs {
R"("hello mars"!)",
R"("hello earth")",
R"(no quotes)",
R"("unfinished quotes)",
R"(")",
R"("")",
R"(""")",
R"("""")",
R"( "leading whitespace")",
R"("trailing whitespace" )",
};
for (const auto& input : inputs)
{
std::cout << "Tokenizing '" << input << "'\n";
try {
auto tokens = tokenizeQuoted(input);
for (const auto & token : tokens)
{
std::cout << " - '" << token << "'\n";
}
} catch (std::runtime_error& e) {
std::cout << e.what() << "\n";
}
}
}
Note how the eof bit updates to the correct state after calling peek()
in the "hello earth" extraction.
Upvotes: 2
Reputation: 63481
The rules are documented:
- b) Otherwise (if the first character is the delimiter):
- Turns off the
skipws
flag on the input stream.- Empties the destination string by calling
s.clear()
.- Extracts characters from
in
and appends them tos
, except that whenever an escape character is extracted, it is ignored and the next character is appended tos
. Extraction stops when!in == true
or when an unescapeddelim
character is found.- Discards the final (unescaped)
delim
character.- Restores the
skipws
flag on the input stream to its original value.
What this means is that you cannot reliably determine whether the quoted string was properly formed by purely using the I/O manipulator. It is too simple.
That being said, if you add a bit of extra debugging information where you are throwing the exception, you'll see that the failure actually happens after you processed the token and actually happens when you try (and fail) to extract the next one:
while (!iss.eof())
{
token.clear();
iss >> std::quoted(token);
if (iss.fail())
{
throw std::runtime_error(
"failed to tokenize string: '" + input + "'"
"; token: '" + token + "'"
"; extracted: " + std::to_string(tokens.size()) + ""
"; bad bit = " + (iss.bad() ? "true" : "false"));
}
tokens.push_back(token);
}
This throws:
failed to tokenize string: '"hello earth"'; token: ''; extracted: 1; bad bit = false
You can see that you already processed one token, and then failed to process the next one. That's because the EOF bit was not set on the stream. This is expected because the closing double-quote ended extraction.
It quickly gets messy and unintuitive when relying the various stream error bits for parsing. You probably want to handle specific error scenarios such as unterminated quoted strings, while still handling empty strings.
A more intuitive approach might be to tokenize the string using a regular expression. You can even use a std::regex_iterator
and then parse each detected string with std::quoted
, since by that stage you'll know it's properly formed.
Upvotes: 1