Reputation: 5019
#include <algorithm>
#include <iostream>
#include <string>
#include <vector>
#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/support_standard_wide.hpp>
void parse_simple_string()
{
namespace qi = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;
//namespace stw = boost::spirit::standard_wide;
typedef std::wstring::const_iterator iterator_type;
std::vector<std::wstring> result;
std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)";
qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\""));
qi::phrase_parse(input.begin(), input.end(),
key % qi::lit(L"\",\""),
encoding::space,
result);
//std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t> (std::wcout, L"\n"));
for(auto const &data : result) std::wcout<<data<<std::endl;
}
I studied this post How to use Boost Spirit to parse Chinese(unicode utf-16)? and follow the guides, but fail to parse the words "你好"
the expected results should be
12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP 你好
but the actual results are 12,3 ab,cd G,G\"GG kkk 10,\"0 99987 PPP
Failed to parse chinese words "你好"
OS is win7 64bits, my editor save the words as UTF-8
Upvotes: 4
Views: 5101
Reputation: 1537
Although the answer of Evgeny Panasyuk is correct, the use of u8_to_u32_iterator
may not be safe due to buffer overflow error if the input string is not NUL terminated. Consider the example as following:
File foobar.cpp
#include "boost/regex/pending/unicode_iterator.hpp"
#include <iostream>
int main() {
const char contents[] = {'H', 'e', 'l', 'l', 'o', '\xF1'};
using utf8_iter = boost::u8_to_u32_iterator<const char *>;
auto iter = utf8_iter{contents};
auto end = utf8_iter{contents + sizeof(contents)};
for (; iter != end; ++iter)
std::cout << *iter << '\n';
}
When compiled with the commands clang++ -g -fsanitize=address -std=c++17 -I path/to/boost/ -o foobar foobar.cpp
then run, clang address sanitizer will display stack-buffer-overflow
error. The error occurred because last character in the buffer is leading byte of a 4-byte UTF-8 sequence => the iterator continue to read bytes after it ==> Buffer overflow.
If the last byte is NUL const char contents[] = "Hello\xF1";
, the iterator will detect encoding error when reading the NUL character and abort the next reads ==> We will have uncaught exceptions instead of Undefined Behaviors.
In short, make sure the input is NUL terminated before using boost::u8_to_u32_iterator
or you may risk encountering UB.
Upvotes: 2
Reputation: 9199
If you have UTF-8 at input, then you may try to use Unicode Iterators from Boost.Regex.
For instance, use boost::u8_to_u32_iterator:
A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters.
#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>
int main()
{
using namespace boost;
using namespace spirit::qi;
using namespace std;
auto &&utf8_text=u8"你好,世界!";
u8_to_u32_iterator<const char*>
tbegin(begin(utf8_text)), tend(end(utf8_text));
vector<uint32_t> result;
parse(tbegin, tend, *standard_wide::char_, result);
for(auto &&code_point : result)
cout << "&#" << code_point << ";";
cout << endl;
}
Output is:
你好,世界!�
Upvotes: 10