Reputation: 9411

Parse a parcticular string using Boost Spirit Qi

I am new to Boost Spirit and is struggling to create a proper expression to parse the following input (actually a result of a stdout of some command):

^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503us[+9103us] +/-   55ms

Which I need to parse into a set of strings and integers and recorded in variables. Most of the line should be just parsed into a variable of appropriate type (string or int). So in the end, I get:

string:  "^+", "line-17532.dyn.kponet.fi", "+1503us", "+9103us", "55ms"
int   :   2, 7, 377, 1

The pair

+1503us[+9103us]

can also be with space

+503us[ +103us]

and I need stuff before square brackets and in square brackets to be placed in separate strings.

additionally, time designations can be expressed as

ns, ms, us, s

I appreciate examples about how to deal with it, because the available documentation is quite sparse and not cohesive.

Large piece of the log, along with headings describing the individual fields:

MS Name/IP address         Stratum Poll Reach LastRx Last sample               
===============================================================================
^+ ns2.sdi.fi                    2   9   377   381  -1476us[-1688us] +/-   72ms
^+ line-17532.dyn.kponet.fi      2  10   377   309   +302us[ +302us] +/-   59ms
^* heh.fi                        2  10   377   319  -1171us[-1387us] +/-   50ms
^+ stara.mulimuli.fi             3  10   377   705  -1253us[-1446us] +/-   73ms

Upvotes: 4

Answers (3)

Jerry Coffin

Reputation: 490178

This is one of those times I can almost feel some sympathy for people who claim that C++ has just added complexity, and C was really better. It does lose some things like type safety, but consider what reading this looks like with C's scanf:

struct record {
    char prefix[256];
    char url[256];
    int a, b, c, d;
    char time1[256];
    char time2[256];
    char time3[256];
};

sscanf(input, 
       "%255s %255s %d %d %d %d %255[^[][ %255[^]]] +/- %255s",
       r.prefix, r.url, &r.a, &r.b, &r.c, &r.d, r.time1, r.time2, r.time3);

This does, of course, have a few potential liabilities:

It reads into arrays of char instead of std::strings.
scanf and cousins aren't type safe.
It doesn't try to verify the suffixes on the times.
A Spirit-based parser may easily be at least somewhat faster.

If any of these is really a serious problem for your purposes, you might really need a different approach. Given what it looks like the code is probably intended to do, it's not immediately obvious that any of them is likely to cause a real problem though.

Upvotes: 2

Dan Mašek

Reputation: 19041

Note: This answer shows a simpler approach, forming a foundation for additional techniques shown by sehe.

Preamble

Let's enable Spirit debug output, so we can follow the progress of our parses while we're developing them.

#define BOOST_SPIRIT_DEBUG 1

#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/include/adapt_struct.hpp>

namespace qi = boost::spirit::qi;

Log Entry Data Structure

The first step would be to define a structure to hold out parsed log entries.

struct log_entry_t
{
    std::string element_0;
    std::string element_1;
    uint32_t element_2;
    uint32_t element_3;
    uint32_t element_4;
    uint32_t element_5;
    std::string element_6;
    std::string element_7;
    std::string element_8;
};

Adapting the Data Structure

In order to be able to use the structure as an attribute of a Spirit grammar, we need to adapt it into a fusion tuple. (More info is in one of Spirit tutorials) This is achieved using BOOST_FUSION_ADAPT_STRUCT.

BOOST_FUSION_ADAPT_STRUCT(
    log_entry_t
    , (std::string, element_0)
    , (std::string, element_1)
    , (uint32_t, element_2)
    , (uint32_t, element_3)
    , (uint32_t, element_4)
    , (uint32_t, element_5)
    , (std::string, element_6)
    , (std::string, element_7)
    , (std::string, element_8)
)

Log Line Grammar

Next, we define the grammar for the log entry. Since the individual entries may be separated by whitespace, we want to use phrase parsing, and thus need to specify a skip parser. qi::blank_type is an appropriate skipper, since it matches spaces and tabs only.

However, all of the elements should be treated as lexemes, we do not specify any skipper for their rules.

template <typename Iterator>
struct log_line_parser
    : qi::grammar<Iterator, log_entry_t(), qi::blank_type>
{
    typedef qi::blank_type skipper_t;

    log_line_parser()
        : log_line_parser::base_type(log_line)
    {
        element_0 %= qi::string("^+");
        element_1 %= qi::raw[(+qi::char_("-a-zA-Z0-9") % qi::char_('.'))];
        element_2 %= qi::uint_;
        element_3 %= qi::uint_;
        element_4 %= qi::uint_;
        element_5 %= qi::uint_;
        element_6 %= qi::raw[qi::char_('+') >> qi::uint_ >> time_unit];
        element_7 %= qi::raw[qi::char_('+') >> qi::uint_ >> time_unit];
        element_8 %= qi::raw[qi::uint_ >> time_unit];

        time_unit %= -qi::char_("nmu") >> qi::char_('s');

        log_line
            %=  element_0
            >>  element_1
            >>  element_2
            >>  element_3
            >>  element_4
            >>  element_5
            >>  element_6
            >>  qi::lit('[') >> element_7 >> qi::lit(']')
            >>  qi::lit("+/-")
            >>  element_8
            ;

        init_debug();
    }

    void init_debug()
    {
        BOOST_SPIRIT_DEBUG_NODE(element_0);
        BOOST_SPIRIT_DEBUG_NODE(element_1);
        BOOST_SPIRIT_DEBUG_NODE(element_2);
        BOOST_SPIRIT_DEBUG_NODE(element_3);
        BOOST_SPIRIT_DEBUG_NODE(element_4);
        BOOST_SPIRIT_DEBUG_NODE(element_5);
        BOOST_SPIRIT_DEBUG_NODE(element_6);
        BOOST_SPIRIT_DEBUG_NODE(element_7);
        BOOST_SPIRIT_DEBUG_NODE(element_8);

        BOOST_SPIRIT_DEBUG_NODE(time_unit);

        BOOST_SPIRIT_DEBUG_NODE(log_line);
    }

private:
    qi::rule<Iterator, std::string()> element_0;
    qi::rule<Iterator, std::string()> element_1;
    qi::rule<Iterator, uint32_t()> element_2;
    qi::rule<Iterator, uint32_t()> element_3;
    qi::rule<Iterator, uint32_t()> element_4;
    qi::rule<Iterator, uint32_t()> element_5;
    qi::rule<Iterator, std::string()> element_6;
    qi::rule<Iterator, std::string()> element_7;
    qi::rule<Iterator, std::string()> element_8;

    qi::rule<Iterator, std::string()> time_unit;

    qi::rule<Iterator, log_entry_t(), skipper_t> log_line;
};

Let's go through some of the rules:

Element 0 - this is a simple string we need to match. Since we wish to capture it as well, we need to use the string parser.
Element 1 - We can use the char_ parser to match either a single character or a character set. The + parser operator represents repetition, and the % (list) parser operator let's us parse several repetitions separated by a separator (in our case a dot).
Element 2 - To parse numbers, we can use existing numeric parsers.
Element 6 - Since we want to capture the whole sequence in a string, we use the raw parser directive

In order to determine the resulting attribute type when using parser operators, refer to the reference of compound attribute rules.

Test Function

bool test(std::string const& log)
{
    std::cout << "Parsing: " << log << "\n\n";

    std::string::const_iterator iter(log.begin());
    std::string::const_iterator end(log.end());

    log_line_parser<std::string::const_iterator> g;

    log_entry_t entry;

    bool r(qi::phrase_parse(iter, end, g, qi::blank, entry));

    std::cout << "-------------------------\n";

    if (r && (iter == end)) {
        std::cout << "Parsing succeeded\n";
        std::cout << entry.element_0 << "\n"
            << entry.element_1 << "\n"
            << entry.element_2 << "\n"
            << entry.element_3 << "\n"
            << entry.element_4 << "\n"
            << entry.element_5 << "\n"
            << entry.element_6 << "\n"
            << entry.element_7 << "\n"
            << entry.element_8 << "\n";
    } else {
        std::string::const_iterator some = iter + 30;
        std::string context(iter, (some > end) ? end : some);
        std::cout << "Parsing failed\n";
        std::cout << "stopped at: \": " << context << "...\"\n";
    }

    return r;
}

Main Function

Finally, let's run a few positive and negative tests on our parser.

int main()
{
    bool result(true);
    result &= test("^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503us[+9103us] +/-   55ms");
    result &= test("^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503us[ +9103us] +/-   55ms");
    result &= test("^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503ms[+9103ns] +/-   55s");

    result &= !test("^- line-17532.dyn.kponet.fi      2   7   377     1   +1503us[+9103us] +/-   55ms");
    result &= !test("^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503us[+9103us] +/-   55 ms");
    result &= !test("^+ line-17532.dyn.kponet.fi      2   7   377     1   + 1503us[+9103us] +/-   55ms");
    result &= !test("^+ line-17532.dyn.kponet.fi      2   7   +377     1   +1503us[+9103us] +/-   55ms");
    result &= !test("^+ line-17532.dyn.kponet.fi      2   7   3 77     1   +1503us[+9103us] +/-   55ms");
    result &= !test("^+ line-17532.dyn.kponet.fi      2   7   -377     1   +1503us[+9103us] +/-   55ms");


    std::cout << "Test result = " << result << "\n";

    return 0;
}

After a lot of debugging output (example for the first test):

Parsing: ^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503us[+9103us] +/-   55ms

<log_line>
  <try>^+ line-17532.dyn.kp</try>
  <element_0>
    <try>^+ line-17532.dyn.kp</try>
    <success> line-17532.dyn.kpon</success>
    <attributes>[[^, +]]</attributes>
  </element_0>
  <element_1>
    <try>line-17532.dyn.kpone</try>
    <success>      2   7   377   </success>
    <attributes>[[l, i, n, e, -, 1, 7, 5, 3, 2, ., d, y, n, ., k, p, o, n, e, t, ., f, i]]</attributes>
  </element_1>
  <element_2>
    <try>2   7   377     1   </try>
    <success>   7   377     1   +</success>
    <attributes>[2]</attributes>
  </element_2>
  <element_3>
    <try>7   377     1   +150</try>
    <success>   377     1   +1503</success>
    <attributes>[7]</attributes>
  </element_3>
  <element_4>
    <try>377     1   +1503us[</try>
    <success>     1   +1503us[+91</success>
    <attributes>[377]</attributes>
  </element_4>
  <element_5>
    <try>1   +1503us[+9103us]</try>
    <success>   +1503us[+9103us] </success>
    <attributes>[1]</attributes>
  </element_5>
  <element_6>
    <try>+1503us[+9103us] +/-</try>
    <time_unit>
      <try>us[+9103us] +/-   55</try>
      <success>[+9103us] +/-   55ms</success>
      <attributes>[[u, s]]</attributes>
    </time_unit>
    <success>[+9103us] +/-   55ms</success>
    <attributes>[[+, 1, 5, 0, 3, u, s]]</attributes>
  </element_6>
  <element_7>
    <try>+9103us] +/-   55ms</try>
    <time_unit>
      <try>us] +/-   55ms</try>
      <success>] +/-   55ms</success>
      <attributes>[[u, s]]</attributes>
    </time_unit>
    <success>] +/-   55ms</success>
    <attributes>[[+, 9, 1, 0, 3, u, s]]</attributes>
  </element_7>
  <element_8>
    <try>55ms</try>
    <time_unit>
      <try>ms</try>
      <success></success>
      <attributes>[[m, s]]</attributes>
    </time_unit>
    <success></success>
    <attributes>[[5, 5, m, s]]</attributes>
  </element_8>
  <success></success>
  <attributes>[[[^, +], [l, i, n, e, -, 1, 7, 5, 3, 2, ., d, y, n, ., k, p, o, n, e, t, ., f, i], 2, 7, 377, 1, [+, 1, 5, 0, 3, u, s], [+, 9, 1, 0, 3, u, s], [5, 5, m, s]]]</attributes>
</log_line>
-------------------------
Parsing succeeded
^+
line-17532.dyn.kponet.fi
2
7
377
1
+1503us
+9103us
55ms

the program prints the following line:

Test result = 1

Live sample on Coliru

Upvotes: 3

sehe

Reputation: 393134

As always I start with sketching a useful AST:

namespace AST {
    using clock = std::chrono::high_resolution_clock;

    struct TimeSample {
        enum Direction { up, down } direction; // + or -
        clock::duration value;
    };

    struct Record {
        std::string prefix; // "^+"
        std::string fqdn;   // "line-17532.dyn.kponet.fi"
        int a, b, c, d;     // 2, 7, 377, 1
        TimeSample primary, braced;
        clock::duration tolerance;
    };
}

Now that we know what we want to parse, we mostly just mimick the AST with rules, for a bit:

using namespace qi;

start     = skip(blank) [record_];

record_   = prefix_ >> fqdn_ >> int_ >> int_ >> int_ >> int_ >> sample_ >> '[' >> sample_ >> ']' >> tolerance_;

prefix_   = string("^+"); // or whatever you need to match here
fqdn_     = +graph; // or whatever additional constraints you have
sample_   = direction_ >> duration_;
duration_ = (long_ >> units_) [ _val = _1 * _2 ];
tolerance_= "+/-" >> duration_;

Of course, the interesting bits are the units and the direction:

struct directions : qi::symbols<char, AST::TimeSample::Direction> {
    directions() { add("+", AST::TimeSample::up)("-", AST::TimeSample::down); }
} direction_;
struct units : qi::symbols<char, AST::clock::duration> {
    units() {
        using namespace std::literals::chrono_literals;
        add("s", 1s)("ms", 1ms)("us", 1us)("µs", 1us)("ns", 1ns);
    }
} units_;

The white-space acceptance is governed by a skipper; I chose qi::blank_type for the non-lexeme rules:

using Skipper = qi::blank_type;
qi::rule<It, AST::Record()> start;
qi::rule<It, AST::Record(), Skipper> record_;
qi::rule<It, AST::TimeSample(), Skipper> sample_;
qi::rule<It, AST::clock::duration(), Skipper> duration_, tolerance_;
// lexemes:
qi::rule<It, std::string()> prefix_;
qi::rule<It, std::string()> fqdn_;

DEMO

Putting it all together, use it:

int main() {
    std::istringstream iss(R"(^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503us[+9103us] +/-   55ms
)");

    std::string line;

    while (getline(iss, line)) {
        auto f = line.cbegin(), l = line.cend();
        AST::Record record;
        if (parse(f, l, parser<>{}, record))
            std::cout << "parsed: " << boost::fusion::as_vector(record) << "\n";
        else
            std::cout << "parse error\n";

        if (f!=l)
            std::cout << "remaining unparsed input: '" << std::string(f,l) << "'\n";
    }
}

Which prints: Live On Coliru

parsed: (^+ line-17532.dyn.kponet.fi 2 7 377 1 +0.001503s +0.009103s 0.055s)

(debug output below)

Full Code:

Live On Coliru

#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/fusion/adapted.hpp>
#include <sstream>
#include <chrono>

namespace std { namespace chrono {
    // for debug
    std::ostream& operator<<(std::ostream& os, duration<double> d) { return os << d.count() << "s"; }
} }

namespace AST {
    using clock = std::chrono::high_resolution_clock;

    struct TimeSample {
        enum Direction { up, down } direction; // + or -
        clock::duration value;

        // for debug:
        friend std::ostream& operator<<(std::ostream& os, Direction d) {
            char const* signs[] = {"+","-"};
            return os << signs[d];
        }
        friend std::ostream& operator<<(std::ostream& os, TimeSample const& sample) {
            return os << sample.direction << std::chrono::duration<double>(sample.value).count() << "s";
        }
    };

    struct Record {
        std::string prefix; // "^+"
        std::string fqdn;   // "line-17532.dyn.kponet.fi"
        int a, b, c, d;     // 2, 7, 377, 1
        TimeSample primary, braced;
        clock::duration tolerance;
    };
}

BOOST_FUSION_ADAPT_STRUCT(AST::Record, prefix, fqdn, a, b, c, d, primary, braced, tolerance)
BOOST_FUSION_ADAPT_STRUCT(AST::TimeSample, direction, value)

namespace qi = boost::spirit::qi;

template <typename It = std::string::const_iterator>
struct parser : qi::grammar<It, AST::Record()> {
    parser() : parser::base_type(start) {
        using namespace qi;

        start     = skip(blank) [record_];

        record_   = prefix_ >> fqdn_ >> int_ >> int_ >> int_ >> int_ >> sample_ >> '[' >> sample_ >> ']' >> tolerance_;

        prefix_   = string("^+"); // or whatever you need to match here
        fqdn_     = +graph; // or whatever additional constraints you have
        sample_   = direction_ >> duration_;
        duration_ = (long_ >> units_) [ _val = _1 * _2 ];
        tolerance_= "+/-" >> duration_;

        BOOST_SPIRIT_DEBUG_NODES(
                (start)(record_)
                (prefix_)(fqdn_)(sample_)(duration_)(tolerance_)
            )
    }
  private:
    struct directions : qi::symbols<char, AST::TimeSample::Direction> {
        directions() { add("+", AST::TimeSample::up)("-", AST::TimeSample::down); }
    } direction_;
    struct units : qi::symbols<char, AST::clock::duration> {
        units() {
            using namespace std::literals::chrono_literals;
            add("s", 1s)("ms", 1ms)("us", 1us)("µs", 1us)("ns", 1ns);
        }
    } units_;

    using Skipper = qi::blank_type;
    qi::rule<It, AST::Record()> start;
    qi::rule<It, AST::Record(), Skipper> record_;
    qi::rule<It, AST::TimeSample(), Skipper> sample_;
    qi::rule<It, AST::clock::duration(), Skipper> duration_, tolerance_;
    // lexemes:
    qi::rule<It, std::string()> prefix_;
    qi::rule<It, std::string()> fqdn_;
};

int main() {
    std::istringstream iss(R"(^+ line-17532.dyn.kponet.fi      2   7   377     1   +1503us[+9103us] +/-   55ms
)");

    std::string line;

    while (getline(iss, line)) {
        auto f = line.cbegin(), l = line.cend();
        AST::Record record;
        if (parse(f, l, parser<>{}, record))
            std::cout << "parsed: " << boost::fusion::as_vector(record) << "\n";
        else
            std::cout << "parse error\n";

        if (f!=l)
            std::cout << "remaining unparsed input: '" << std::string(f,l) << "'\n";
    }
}

Debug Output

<start>
  <try>^+ line-17532.dyn.kp</try>
  <record_>
    <try>^+ line-17532.dyn.kp</try>
    <prefix_>
      <try>^+ line-17532.dyn.kp</try>
      <success> line-17532.dyn.kpon</success>
      <attributes>[[^, +]]</attributes>
    </prefix_>
    <fqdn_>
      <try>line-17532.dyn.kpone</try>
      <success>      2   7   377   </success>
      <attributes>[[l, i, n, e, -, 1, 7, 5, 3, 2, ., d, y, n, ., k, p, o, n, e, t, ., f, i]]</attributes>
    </fqdn_>
    <sample_>
      <try>   +1503us[+9103us] </try>
      <duration_>
        <try>1503us[+9103us] +/- </try>
        <success>[+9103us] +/-   55ms</success>
        <attributes>[0.001503s]</attributes>
      </duration_>
      <success>[+9103us] +/-   55ms</success>
      <attributes>[[+, 0.001503s]]</attributes>
    </sample_>
    <sample_>
      <try>+9103us] +/-   55ms</try>
      <duration_>
        <try>9103us] +/-   55ms</try>
        <success>] +/-   55ms</success>
        <attributes>[0.009103s]</attributes>
      </duration_>
      <success>] +/-   55ms</success>
      <attributes>[[+, 0.009103s]]</attributes>
    </sample_>
    <tolerance_>
      <try> +/-   55ms</try>
      <duration_>
        <try>   55ms</try>
        <success></success>
        <attributes>[0.055s]</attributes>
      </duration_>
      <success></success>
      <attributes>[0.055s]</attributes>
    </tolerance_>
    <success></success>
    <attributes>[[[^, +], [l, i, n, e, -, 1, 7, 5, 3, 2, ., d, y, n, ., k, p, o, n, e, t, ., f, i], 2, 7, 377, 1, [+, 0.001503s], [+, 0.009103s], 0.055s]]</attributes>
  </record_>
  <success></success>
  <attributes>[[[^, +], [l, i, n, e, -, 1, 7, 5, 3, 2, ., d, y, n, ., k, p, o, n, e, t, ., f, i], 2, 7, 377, 1, [+, 0.001503s], [+, 0.009103s], 0.055s]]</attributes>
</start>

Upvotes: 5