Reputation: 11421
I have a collection of data and a collection of search filters I want to run against that data. The filters follow the LDAP search filter format and are parsed into an expression tree. The data is read one item at a time and processed through all the filters. Intermediate match results are stored in each leaf node of the tree until all the data has been processed. Then the final results are obtained by traversing the tree and applying the logical operators to each leaf node's intermediate result. For example, if I have the filter (&(a=b)(c=d))
then my tree will look like this:
root = "&"
left = "a=b"
right = "c=d"
So if a=b
and c=d
then both the left and right child nodes are a match and thus the filter is a match.
The data is a collection of different types of objects, each with their own fields. For example, assume the collection represents a class at a school:
class { name = "math" room = "12A" }
teacher { name = "John" age = "35" }
student { name = "Billy" age = "6" grade = "A" }
student { name = "Jane" age = "7" grade = "B" }
So a filter might look like (&(teacher.name=John)(student.age>6)(student.grade=A))
and be parsed like so:
root = "&"
left = "teacher.name=John"
right = "&"
left = "student.age>6"
right = "student.grade=A"
I run the class
object against it; no matches. I run the teacher
object against it; root.left
is a match. I run the first student
node against it; root.right.right
is a match. I run the second student
node against it; root.right.left
is a match. Then I traverse the tree and determine that all nodes matched and thus the final result is a match.
The problem is the intermediate matches need to be constrained based upon commonality: the student.age
and student.grade
filters need to somehow be tied together in order to store an intermediate match only if they match for the same object. I can't for the life of me figure out how to do this.
My filter node abstract base class:
class FilterNode
{
public:
virtual void Evaluate(string ObjectName, map<string, string> Attributes) = 0;
virtual bool IsMatch() = 0;
};
I have a LogicalFilterNode
class that handles logical AND, OR, and NOT operations; it's implementation is pretty straightforward:
void LogicalFilterNode::Evaluate(string ObjectName, map<string, string> Attributes)
{
m_Left->Evaluate(ObjectName, Attributes);
m_Right->Evaluate(ObjectName, Attributes);
}
bool LogicalFilterNode::IsMatch()
{
switch(m_Operator)
{
case AND:
return m_Left->IsMatch() && m_Right->IsMatch();
case OR:
return m_Left->IsMatch() || m_Right->IsMatch();
case NOT:
return !m_Left->IsMatch();
}
return false;
}
Then I have a ComparisonFilterNode
class that handles the leaf nodes:
void ComparisonFilterNode::Evaluate(string ObjectName, map<string, string> Attributes)
{
if(ObjectName == m_ObjectName) // e.g. "teacher", "student", etc.
{
foreach(string_pair Attribute in Attributes)
{
Evaluate(Attribute.Name, Attribute.Value);
}
}
}
void ComparisonFilterNode::Evaluate(string AttributeName, string AttributeValue)
{
if(AttributeName == m_AttributeName) // e.g. "age", "grade", etc.
{
if(Compare(AttributeValue, m_AttributeValue) // e.g. "6", "A", etc.
{
m_IsMatch = true;
}
}
}
bool ComparisonFilterNode::IsMatch() { return m_IsMatch; }
How it's used:
FilterNode* Root = Parse(...);
foreach(Object item in Data)
{
Root->Evaluate(item.Name, item.Attributes);
}
bool Match = Root->IsMatch();
Essentially what I need is for AND statements where the children have the same object name, the AND statement should only match if the children match for the same object.
Upvotes: 0
Views: 368
Reputation: 51226
Create a new unary "operator", let's call it thereExists
, which:
Specifically, for each instance of a thereExists
operator in an expression tree you should store a single bit indicating whether or not the subexpression below this tree node has been satisfied by any of the input records seen so far. These flags will initially be set to false
.
To continue processing your dataset efficiently (i.e. input record by input record, without having to load the entire dataset into memory), you should first preprocess the query expression tree to pull out a list of all instances of the thereExists
operator. Then as you read in each input record, test it against the child subexpression of each of these operators that still has its satisfied
flag set to false
. Any subexpression that is now satisfied should toggle its parent thereExists
node's satisfied
flag to true
-- and it would be a good idea to also attach a copy of the satisfying record to the newly-satisfied thereExists
node, if you want to actually see more than a "yes" or "no" answer to the overall query.
You only need to evaluate tree nodes above a thereExists
node once, after all input records have been processed as described above. Notice that anything referring to properties of an individual record must appear somewhere beneath a thereExists
node in the tree. Everything above a thereExists
node in the tree is only allowed to test "global" properties of the collection, or combine the results of thereExists
nodes using logical operators (AND, OR, XOR, NOT, etc.). Logical operators themselves can appear anywhere in the tree.
Using this, you can now evaluate expressions like
root = "&"
left = thereExists
child = "teacher.name=John"
right = "|"
left = thereExists
child = "&"
left = "student.age>6"
right = "student.grade=A"
right = thereExists
child = "student.name = Billy"
This will report "yes" if the collection of records contains both a teacher whose name is "John" and either a student named "Billy" or an A student aged over 6, or "no" otherwise. If you track satisfying records as I suggested, you'll also be able to dump these out in the case of a "yes" answer.
You could also add a second operator type, forAll
, which checks that its subexpression is true for every input record. But this is probably not as useful, and in any case you can simulate forAll(expr)
with not(thereExists(not(expr)))
.
Upvotes: 1