Reputation: 125
I have strings of format:
7XXXX 8YYYY 9ZZZZ 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL
,
7XXXX 8YYYY 9ZZZZ 0LLLL
groups can repeat any number of times;7XXXX 0LLLL 8YYYY 0LLLL 7XXXX 8YYYY 9ZZZZ 0LLLL
I am trying to accomplish my goal using Boost::regex library.
I want to split these groups and get them into an array or vector. For now I am trying to cout
them.
I am trying to do it this way, but I only can get full string match or last match in every of 7,8,9,0 groups, but not strings like these 7XXXX 8YYYY 9ZZZZ 0LLLL
const char* pat = "(([[:space:]]+7[0-9]{4}){0,1}([[:space:]]+8[0-9]{4}){0,1}([[:space:]]+9[0-9]{4}){0,1}([[:space:]]+0[0-9]{4}){0,1})+";;
boost::regex reg(pat);
boost::smatch match;
string example= "71122 85451 75415 01102 75555 82133 91341 02134";
const int subgroups[] = {0,1,2,3,4,5,6};
boost::sregex_token_iterator i(example.begin(), example.end(), reg, subgroups);
boost::sregex_token_iterator j;
while (i != j)
{
cout << "Match: " << *i++ << endl;
}
Sample output:
Match: 71122 85451 75415 01102 75555 82133 91341 02134
<A bunch of empty "Match:" rows>
Match: 75555
Match: 82133
Match: 91341
Match: 02134
<A bunch of empty "Match:" rows>
But I want to get it like this:
71122 85451
75415 01102
75555 82133 91341 02134
I know I am doing it wrong, can't come up with something good using regex to do what I want :( Why can't I get all the recursive matches using parentheses?
Upvotes: 1
Views: 1243
Reputation: 393114
I think I'd hand roll a parser here. In the interest of agility, how about parsing with Spirit
It expresses intent quite clearly: a sequence is any combination of items in the expected order - as long as the result has at least one item
seq_ = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0');
where item_
parses any integer that starts with the indicated digit:
item_ = &char_(_r1) >> uint_;
In the parser we parse any number of sequences with *seq
which is why we added a check that each matched sequence is not empty (otherwise we could get an infinite loop matching empty sequences at the same input location)
eps(phx::size(_val) > 0) // require 1 element at least
Note how debugging is built in (enable it by uncommenting the first line).
Note how it would be trivial to exclude the leading digits from the result by omitting the lead character: See alternative version on Coliru:
item_ = omit[char_(_r1)] >> uint_;
Test program output:
Parsing: 71122 85451 75415 01102 75555 82133 91341 02134
Parsed: 3 sequences
seq: 71122 85451
seq: 75415 1102
seq: 75555 82133 91341 2134
//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
using data = std::vector<std::vector<unsigned> >;
template <typename It, typename Skipper = qi::space_type>
struct grammar : qi::grammar<It, data(), Skipper> {
grammar() : grammar::base_type(start) {
using namespace qi;
start = *seq_;
seq_ = -item_('7') >> -item_('8') >> -item_('9') >> -item_('0')
>> eps(phx::size(_val) > 0)
;
item_ = &char_(_r1) >> uint_;
BOOST_SPIRIT_DEBUG_NODES((start)(item_)(seq_))
}
private:
qi::rule<It, unsigned(char), Skipper> item_;
qi::rule<It, std::vector<unsigned>(), Skipper> seq_;
qi::rule<It, data(), Skipper> start;
};
int main() {
for (std::string const input : {
"71122 85451 75415 01102 75555 82133 91341 02134"
})
{
using It = std::string::const_iterator;
grammar<It> p;
auto f(input.begin()), l(input.end());
data parsed;
bool ok = qi::phrase_parse(f,l,p,qi::space,parsed);
std::cout << "Parsing: " << input << "\n";
if (ok) {
std::cout << "Parsed: " << parsed.size() << " sequences\n";
for(auto& seq : parsed)
std::copy(seq.begin(), seq.end(), std::ostream_iterator<unsigned>(std::cout << "\nseq:\t", " "));
std::cout << "\n";
} else {
std::cout << "Parsed failed\n";
}
if (f!=l)
std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
}
}
Upvotes: 1
Reputation: 44043
EDIT: Since I completely misunderstood the first time around, I'll just replace the whole answer. I'm thinking along these lines:
const char* pat = "[[:space:]]+((7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?)";
boost::regex reg(pat);
boost::smatch match;
// v-- extra space here to make the match easier.
std::string example= " 71122 85451 75415 01102 75555 82133 91341 02134";
boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;
while (i != j)
{
std::cout << "Match: " << *i++ << std::endl;
}
If the string cannot be modified, a workaround around the problem of empty matches is
const char* pat = "((7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?)";
boost::regex reg(pat);
boost::smatch match;
std::string example= "71122 85451 75415 01102 75555 82133 91341 02134";
boost::sregex_token_iterator i(example.begin(), example.end(), reg, 1);
boost::sregex_token_iterator j;
while (i != j)
{
if(i->length() != 0) {
std::cout << "Match: " << *i << std::endl;
}
++i;
}
Although in that case it'd arguably be nicer to use regex_iterator
instead of regex_token_iterator
:
// No need for outer spaces anymore
const char* pat = "(7[0-9]{4})?([[:space:]]+8[0-9]{4})?([[:space:]]+9[0-9]{4})?([[:space:]]+0[0-9]{4})?";
boost::sregex_iterator i(example.begin(), example.end(), reg);
boost::sregex_iterator j;
// Rest the same.
Upvotes: 1