Reputation: 61
I want to tokenize a lot of Burmese text. So I tried using boost
tokenizer.
The text that I was trying with is ျခင္းခတ္ခဲ့တာလို႕
and it should get tokenized to ျခင္း
and င္းျခင္း
but it just outputs the input. Is there something I am doing wrong?
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>
int main(){
using namespace std;
using namespace boost;
string s = "ျခင္းခတ္ခဲ့တာလို႕";
tokenizer<> tok(s);
for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
}
The output should break into a series of tokens like: ျခင္း
and ခတ္ခဲ့တာလို႕
but currently, the output is equal to input.
I want to tokenize this into a series of tokens with word boundaries if possible.
Upvotes: 1
Views: 215
Reputation: 392911
I don't understand that language, but detecting word boundaries is, in general, not tokenizing.
Instead, use Boost Locale's Boundary Analysis
The sample:
using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="To be or not to be, that is the question."
// Create mapping of text for token iterator using global locale.
ssegment_index map(word,text.begin(),text.end(),gen("en_US.UTF-8"));
// Print all "words" -- chunks of word boundary
for(ssegment_index::iterator it=map.begin(),e=map.end();it!=e;++it)
std::cout <<"\""<< * it << "\", ";
std::cout << std::endl;
Would print
"To", " ", "be", " ", "or", " ", "not", " ", "to", " ", "be", ",", " ", "that", " ", "is", " ", "the", " ", "question", ".",
And this sentence "生きるか死ぬか、それが問題だ。"
would be split into following segments in ja_JP.UTF-8 (Japanese) locale:
"生", "きるか", "死", "ぬか", "、", "それが", "問題", "だ", "。",
A demo using the OP's text and the my_MM locale:
#include <boost/range/iterator_range.hpp>
#include <boost/locale.hpp>
#include <boost/locale/boundary.hpp>
#include <iostream>
#include <iomanip>
int main() {
using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text="ျခင္းခတ္ခဲ့တာလို႕";
ssegment_index map(word,text.begin(),text.end(),gen("my_MM.UTF-8"));
for (auto&& segment : boost::make_iterator_range(map.begin(), map.end()))
std::cout << std::quoted(segment.str()) << std::endl;
}
Prints
"ျ"
"ခ"
"င္း"
"ခ"
"တ္"
"ခဲ့"
"တာ"
"လို႕"
This may, or may not be what the OP expects. Note that you might have to generate/install the appropriate locale(s) on your system for it to work as expected.
Upvotes: 1