Luke
Luke

Reputation: 11421

How to evaluate a complex expression tree against incremental data?

I have a collection of data and a collection of search filters I want to run against that data. The filters follow the LDAP search filter format and are parsed into an expression tree. The data is read one item at a time and processed through all the filters. Intermediate match results are stored in each leaf node of the tree until all the data has been processed. Then the final results are obtained by traversing the tree and applying the logical operators to each leaf node's intermediate result. For example, if I have the filter (&(a=b)(c=d)) then my tree will look like this:

root = "&"
    left = "a=b"
    right = "c=d"

So if a=b and c=d then both the left and right child nodes are a match and thus the filter is a match.

The data is a collection of different types of objects, each with their own fields. For example, assume the collection represents a class at a school:

class { name = "math" room = "12A" }
teacher { name = "John" age = "35" }
student { name = "Billy" age = "6" grade = "A" }
student { name = "Jane" age = "7" grade = "B" }

So a filter might look like (&(teacher.name=John)(student.age>6)(student.grade=A)) and be parsed like so:

root = "&"
    left = "teacher.name=John"
    right = "&"
        left = "student.age>6"
        right = "student.grade=A"

I run the class object against it; no matches. I run the teacher object against it; root.left is a match. I run the first student node against it; root.right.right is a match. I run the second student node against it; root.right.left is a match. Then I traverse the tree and determine that all nodes matched and thus the final result is a match.

The problem is the intermediate matches need to be constrained based upon commonality: the student.age and student.grade filters need to somehow be tied together in order to store an intermediate match only if they match for the same object. I can't for the life of me figure out how to do this.

My filter node abstract base class:

class FilterNode
{
public:
    virtual void Evaluate(string ObjectName, map<string, string> Attributes) = 0;
    virtual bool IsMatch() = 0;
};

I have a LogicalFilterNode class that handles logical AND, OR, and NOT operations; it's implementation is pretty straightforward:

void LogicalFilterNode::Evaluate(string ObjectName, map<string, string> Attributes)
{
    m_Left->Evaluate(ObjectName, Attributes);
    m_Right->Evaluate(ObjectName, Attributes);
}

bool LogicalFilterNode::IsMatch()
{
    switch(m_Operator)
    {
    case AND:
        return m_Left->IsMatch() && m_Right->IsMatch();
    case OR:
        return m_Left->IsMatch() || m_Right->IsMatch();
    case NOT:
        return !m_Left->IsMatch();
    }
    return false;
}

Then I have a ComparisonFilterNode class that handles the leaf nodes:

void ComparisonFilterNode::Evaluate(string ObjectName, map<string, string> Attributes)
{
    if(ObjectName == m_ObjectName) // e.g. "teacher", "student", etc.
    {
        foreach(string_pair Attribute in Attributes)
        {
            Evaluate(Attribute.Name, Attribute.Value);
        }
    }
}

void ComparisonFilterNode::Evaluate(string AttributeName, string AttributeValue)
{
    if(AttributeName == m_AttributeName) // e.g. "age", "grade", etc.
    {
        if(Compare(AttributeValue, m_AttributeValue) // e.g. "6", "A", etc.
        {
            m_IsMatch = true;
        }
    }
}

bool ComparisonFilterNode::IsMatch() { return m_IsMatch; }

How it's used:

FilterNode* Root = Parse(...);
foreach(Object item in Data)
{
    Root->Evaluate(item.Name, item.Attributes);
}
bool Match = Root->IsMatch();

Essentially what I need is for AND statements where the children have the same object name, the AND statement should only match if the children match for the same object.

Upvotes: 0

Views: 368

Answers (1)

j_random_hacker
j_random_hacker

Reputation: 51226

Create a new unary "operator", let's call it thereExists, which:

  1. Does have state, and
  2. Declares that its child subexpression must be satisfied by a single input record.

Specifically, for each instance of a thereExists operator in an expression tree you should store a single bit indicating whether or not the subexpression below this tree node has been satisfied by any of the input records seen so far. These flags will initially be set to false.

To continue processing your dataset efficiently (i.e. input record by input record, without having to load the entire dataset into memory), you should first preprocess the query expression tree to pull out a list of all instances of the thereExists operator. Then as you read in each input record, test it against the child subexpression of each of these operators that still has its satisfied flag set to false. Any subexpression that is now satisfied should toggle its parent thereExists node's satisfied flag to true -- and it would be a good idea to also attach a copy of the satisfying record to the newly-satisfied thereExists node, if you want to actually see more than a "yes" or "no" answer to the overall query.

You only need to evaluate tree nodes above a thereExists node once, after all input records have been processed as described above. Notice that anything referring to properties of an individual record must appear somewhere beneath a thereExists node in the tree. Everything above a thereExists node in the tree is only allowed to test "global" properties of the collection, or combine the results of thereExists nodes using logical operators (AND, OR, XOR, NOT, etc.). Logical operators themselves can appear anywhere in the tree.

Using this, you can now evaluate expressions like

root = "&"
    left = thereExists
        child = "teacher.name=John"
    right = "|"
        left = thereExists
            child = "&"
                left = "student.age>6"
                right = "student.grade=A"
        right = thereExists
            child = "student.name = Billy"

This will report "yes" if the collection of records contains both a teacher whose name is "John" and either a student named "Billy" or an A student aged over 6, or "no" otherwise. If you track satisfying records as I suggested, you'll also be able to dump these out in the case of a "yes" answer.

You could also add a second operator type, forAll, which checks that its subexpression is true for every input record. But this is probably not as useful, and in any case you can simulate forAll(expr) with not(thereExists(not(expr))).

Upvotes: 1

Related Questions