Reputation: 919

Filter objects using a user defined filter / runtime expression evaluation

I would like to implement a system that filters objects based on user-defined criteria (as described below) and honestly don't know where to start. If there are existing libraries, then great. If not, then a pointer in the right direction would be great as well.

I have a number of objects, Let's call them Cars which have properties, like make, model, etc.. I would like to be able to have the user provide a string for a filter, say "car.make == "honda" && car.year == "2012"" etc..

Then, over the course of my application's run time, I want to be able to run a check like this: if(filter(carobj) == true){ .... Note that what I am looking for is different from list comprehension in that I don't want to filter down a list, but want to see if an object meets a set of criteria.

I recognize that there are two likely components to this, one is parsing the user's input, the other is constructing such an object. I have a sense that there are some pretty decent expression tree parsers out there that may do the job for the former, but on the latter I am utterly lost.

The filter needs to be fast, because it will be operating on millions of objects and I cannot have boost dependencies either.

Upvotes: 0

Answers (3)

Tony Delroy

Reputation: 106196

Below - a taste of how to tackle the problem - very lightly tested and undoubtedly hiding a few bugs. As presented, it only handles std::string and int fields. It divides filtering into an expression-to-vector-of-tokens step, then repeatedly uses the tokens to test records passed to operator(). So - not highly optimised, but shouldn't be awfully sluggish either. There are some deliberate simplifications, e.g. you can compare "field == 'abc'" but not "'abc' == field". There's a lot more that could be done to validate the expression, provide more information about the location in the expression where parsing or evaluation fails etc.. I've left debug information in as if anyone picks this up they'll probably want it, both to understand how it works and to debug and evolve it.

#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
#include <sstream>
#include <stdexcept>

#define DBG(MSG) do { std::cerr << ':' << __LINE__ << ' ' << MSG << '\n'; } while (false)

#define NEED(WHAT, THROW_MSG) \
    do { if (WHAT) break ; \
        std::ostringstream oss; \
        oss << THROW_MSG; \
        throw std::runtime_error(oss.str()); \
    } while (false)

struct Queryable
{
    virtual int get_field_id(const std::string& field) const = 0;
    virtual void load_field(int id, std::string&, int&) const = 0;
};

class Evaluator
{
  public:
    // lexs expression, optionally proactively verifying field identifiers against *pq
    Evaluator(const std::string& expression, const Queryable* pq = nullptr)
    {
        std::istringstream iss(expression);
        char c;
        int unmatched_paren = 0;
        while (iss >> c)
        {
            switch (c)
            {
              case '(': tokens_.emplace_back(LParen); ++unmatched_paren; break;
              case ')': tokens_.emplace_back(RParen); --unmatched_paren; break;
              case '-': case '0'...'9':
              {
                iss.unget();
                int i;
                iss >> i;
                tokens_.emplace_back(i);
                break;
              }
              case '\'':
                tokens_.emplace_back(StringLit);
                iss >> std::noskipws;
                while (iss >> c)
                    if (c == '\'') goto post_lit;
                    else tokens_.back().s_ += c;
                throw std::runtime_error("unterminated string literal");
                post_lit:
                iss >> std::skipws;
                break;
              case '&':
                NEED(iss.get() == '&', "missing second '&' that'd form AND operator");
                tokens_.emplace_back(And);
                break;
              case '|':
                NEED(iss.get() == '|', "missing second '&' that'd form AND operator");
                tokens_.emplace_back(Or);
                break;
              case '<':
                if (iss.peek() == '=') { iss.ignore(); tokens_.emplace_back(LE); }
                else tokens_.emplace_back(L);
                break;
              case '>':
                if (iss.peek() == '=') { iss.ignore(); tokens_.emplace_back(GE); }
                else tokens_.emplace_back(G);
                break;
              case '!':
                if (iss.peek() == '=') { iss.ignore(); tokens_.emplace_back(NE); }
                else tokens_.emplace_back(Not);
                break;
              case '=':
                if (iss.peek() == '=') iss.ignore(); // allow = and ==
                tokens_.emplace_back(E);
                break;
              default:
                NEED(std::isalpha(c), "can't parse content in expression at "
                    << iss.tellg() << " in '" << iss.str() << "', problem text '"
                    << iss.str().substr(iss.tellg(), 20) << "'...");
                tokens_.emplace_back(Idn);
                tokens_.back().s_ += c;
                iss >> std::noskipws;
                while (iss >> c)
                    if (!std::isalnum(c)) { iss.unget(); goto post_idn; }
                    else tokens_.back().s_ += c;
                post_idn:
                tokens_.back().i_ = pq ? pq->get_field_id(tokens_.back().s_) : 0;
                iss >> std::skipws;
            }
        }
        NEED(!unmatched_paren, "unbalanced paren in expression");
        DBG("tokens parsed: " << tokens_);
    }

    bool operator()(const Queryable& q) const
    {
        size_t token_pos = 0;
        return eval(q, token_pos);
    }

  private:
    bool eval(const Queryable& q, size_t& token_pos) const
    {
        bool so_far = true;
        bool hanging_not = false;
        std::string s;
        int i;
        for ( ; token_pos < tokens_.size(); ++token_pos)
        {
            const Token& t = tokens_[token_pos];
            switch (t.type_)
            {
              case Idn:
              {
                int id = t.i_ ? t.i_ : q.get_field_id(t.s_);
                q.load_field(id, s, i);
                DBG("loaded field " << id << ':' << t.s_ << ", s '" << s << "', i " << i);
                const Token& op = tokens_.at(++token_pos);
                const Token& rhs = tokens_.at(++token_pos);
                switch(op.type_)
                {
                  case L:  so_far = id > 0 ? s <  rhs.s_ : i <  rhs.i_; break;
                  case LE: so_far = id > 0 ? s <= rhs.s_ : i <= rhs.i_; break;
                  case E:  so_far = id > 0 ? s == rhs.s_ : i == rhs.i_; break;
                  case GE: so_far = id > 0 ? s >= rhs.s_ : i >= rhs.i_; break;
                  case G:  so_far = id > 0 ? s >  rhs.s_ : i >  rhs.i_; break;
                  case NE: so_far = id > 0 ? s != rhs.s_ : i != rhs.i_; break;
                  default:
                    NEED(false, "identifier followed by " << op
                         << " but only an operator is supported");
                }
                DBG("  " << op << ' ' << rhs << " -> " << so_far);
                break;
              }
              case And:
              case Or:
                if (so_far == (t.type_ == Or))  // false && ...   true || ...
                {
                    int depth = 0;
                    while (token_pos < tokens_.size() && depth >= 0)
                        if (tokens_[++token_pos].type_ == LParen) ++depth;
                        else if (tokens_[token_pos].type_ == RParen) --depth;
                    return so_far;
                }
                break;

              case Not: hanging_not = true; break;

              case LParen:
                so_far = hanging_not ^ eval(q, ++token_pos);
                hanging_not = false;
                DBG("post LParen so_far " << so_far << ", token_pos " << token_pos);
                break;

              case RParen: return so_far;

              default:
                throw std::runtime_error("unexpect token");
            }
        }
        return so_far;
    }

    enum Type { Idn, StringLit, IntLit, LParen, RParen, Not, And, Or, L, LE, E, GE, G, NE };
    struct Token
    {
        Type type_; std::string s_; int i_;
        Token(Type type) : type_(type) { }
        Token(int i) : type_(IntLit), i_(i) { }
        Token(Type type, const std::string& s) : type_(type), s_(s) { }
        Token(Type type, const std::string&& s) : type_(type), s_(s) { }
    };
    std::vector<Token> tokens_;

    friend std::ostream& operator<<(std::ostream& os, Type t)
    {
        switch (t)
        {
          case Idn: return os << "Idn";
          case StringLit: return os << "StringLit";
          case IntLit: return os << "IntLit";
          case LParen: return os << "LParen";
          case RParen: return os << "RParen";
          case Not: return os << "Not";
          case And: return os << "And";
          case Or: return os << "Or";
          case L: return os << 'L';
          case LE: return os << "LE";
          case E: return os << 'E';
          case GE: return os << "GE";
          case G: return os << 'G';
          case NE: return os << "NE";
          default: throw std::runtime_error("invalid Token type");
       }
    }

    friend std::ostream& operator<<(std::ostream& os, const Token& t)
    {
        os << t.type_;
        if (t.type_ == Idn || t.type_ == StringLit) return os << ":'" << t.s_ << '\'';
        if (t.type_ == IntLit) return os << ':' << t.i_;
        return os;
    }

    friend std::ostream& operator<<(std::ostream& os, const std::vector<Token>& v)
    {
        os << '{';
        size_t pos = 0;
        for (const auto& t : v) os << ' ' << pos++ << ':' << t;
        return os << " }";
    }
};

Sample usage:

struct Car : Queryable
{
    // negative field ids denote integral fields, positive strings, 0 is reserved
    enum Fields { Make = 1, Model, Year = -1};

    Car(const std::string& make, const std::string& model, int year)
      : make_(make), model_(model), year_(year)
    { }

    int get_field_id(const std::string& field) const override
    {
        if (field == "make") return (int)Make;
        if (field == "model") return (int)Model;
        if (field == "year") return (int)Year;
        throw std::runtime_error("attempt to lookup a field that doesn't exist");
    }

    void load_field(int id, std::string& s, int& i) const override
    {
        switch (id)
        {
          case Make: s = make_; break;
          case Model: s = model_; break;
          case Year: i = year_; break;
          default:
            throw std::runtime_error("attempt to retrieve a field using unknown field id");
        }
    }

    std::string make_, model_;
    int year_;
};

#define ASSERT_OP(X, OP, Y) \
    do { \
        const auto& x = (X); const auto& y = (Y); \
        if (x OP y) break; \
        std::cerr << "FAIL " << #X " " #OP " " #Y << " at :" << __LINE__ << '\n'; \
    } while (false)

#define ASSERT_EQ(X, Y) ASSERT_OP(X, ==, Y)
#define ASSERT(X) ASSERT_OP(X, ==, true)
#define ASSERT_NOT(X) ASSERT_OP(X, ==, false)

int main()
{
    Evaluator e("make == 'Honda' && (year == 1999 || year > 2005)");
    ASSERT(e(Car { "Honda", "Fit", 2008 }));
    ASSERT_NOT(e(Car { "Nissan", "GT-R", 2011 }));
    ASSERT(e(Car { "Honda", "NSX", 1999 }));

    // can also do field id lookups at Evaluator construction/lexing time for faster operator()...
    // (but then the Evaluator can't be used against other types with same field names but
    //  differing field ids)
    Car car { "Honda", "Civic", 2012 };
    Evaluator e2("make == 'Honda' && (year == 1999 || year > 2005)", &car);
    ASSERT(e2(car));
    ASSERT(e2(Car { "Honda", "Fit", 2008 }));
    ASSERT_NOT(e2(Car { "Nissan", "GT-R", 2011 }));
    ASSERT(e2(Car { "Honda", "NSX", 1999 }));
}

On coliru.stacked-crooked.com.

FWIW, any readers who're interested in this problem space but do have boost available might prefer using boost spirit and/or using boost::variants to handle different types.

Upvotes: 2

Sam Varshavchik

Reputation: 118445

1) Take a college course, or an equivalent, in compiler design or study up the theory of LALR(1) parsers on your own.

2) Define a formal grammar for your user-enterable filtering language, like the example you gave, where you wish for the user to be able to specify a search criteria simply by typing in car.make == "honda" && car.year == "2012". Try to come up with the lexical and the grammatical structure of your filtering language.

3) Implement your lexical analyzer, and parser, using existing common tool, like lex; and yacc or it's modern cousin, GNU bison, in order to implement the framework for entering a filtering string of the kind you with to implement, and produce some kind of a parsing structure, of the entered filtering string, that you can now execute and apply to your existing list of objects, in order to execute your filter.

That's about it, sounds like a fun project.

Of course, it is not strictly necessary to use the stock lex/yacc framework for implementing something like this. It's certainly possible to implement your own, hand-coded lexical analyzer and grammar parser; that was my approach to coding maildrop, which implements its own, internal, lexical analyzer and recursive-descent parser.

This is not something where one can push a button, or two, and have a canned parser pop out of thin air, to implement something like this. This is a fairly complicated, sophisticated area of computer science, and compiler design.

Upvotes: 1

Brad

Reputation: 2320

One option may be CLucene. If you can't pull in Boost though, that may be too big as well. In that case, you'll have to bite the bullet and write the parser and Abstract Syntax Tree. Grab a Programming Language Concepts book.

Upvotes: 0

Filter objects using a user defined filter / runtime expression evaluation

Answers (3)

Related Questions