Reputation: 2171
I have a task where I have to filter a Pandas DataFrame based on user specified logical expression. Now, I've seen a module called PyParser or LARK which I would like to use but I cannot seem to figure out how to set them up.
I have several operators like CONTAINS
, EQUAL
, FUZZY_MATCH
etc. Also, I'd like to combine some expressions into more complex ones.
Example expression:
ColumnA CONTAINS [1, 2, 3] AND (ColumnB FUZZY_MATCH 'bla' OR ColumnC EQUAL 45)
As a result, I'd like to have some structured Dict or List with levels of operations in order of how to execute them. So, the desired result for this example expression would be something like:
[['ColumnA', 'CONTAINS', '[1, 2, 3]'], 'AND', [['ColumnB', 'FUZZY_MATCH', 'bla'], OR, ['ColumnC', 'EQUAL', '45']]]
or in form of dict:
{
'EXPR1': {
'col': 'ColumnA',
'oper': 'CONTAINS',
'value': '[1, 2, 3]']
},
'OPERATOR': 'AND',
'EXPR2': {
'EXPR21': {
'col': 'ColumnB',
'oper': 'FUZZY_MATCH',
'value': 'bla'
},
'OPERATOR': OR,
'EXPR22': {
'col': 'ColumnC',
'oper': 'EQUAL',
'value': '45'
}
}
}
Or something like that. If you have some better way of structuring the result, I'm open for suggestions. I'm pretty new to this so I'm fairly certain this can be improved.
Upvotes: 3
Views: 1649
Reputation: 8576
Interesting problem :)
Seems like a relatively straightforward application of the shunting yard algorithm.
I had written code to parse expressions like "((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2"
over here.
import re
def tokenize(str):
return re.findall("[+/*()-]|\d+", expression)
def is_number(str):
try:
int(str)
return True
except ValueError:
return False
def peek(stack):
return stack[-1] if stack else None
def apply_operator(operators, values):
operator = operators.pop()
right = values.pop()
left = values.pop()
values.append(eval("{0}{1}{2}".format(left, operator, right)))
def greater_precedence(op1, op2):
precedences = {"+": 0, "-": 0, "*": 1, "/": 1}
return precedences[op1] > precedences[op2]
def evaluate(expression):
tokens = tokenize(expression)
values = []
operators = []
for token in tokens:
if is_number(token):
values.append(int(token))
elif token == "(":
operators.append(token)
elif token == ")":
top = peek(operators)
while top is not None and top != "(":
apply_operator(operators, values)
top = peek(operators)
operators.pop() # Discard the '('
else:
# Operator
top = peek(operators)
while top is not None and top != "(" and greater_precedence(top, token):
apply_operator(operators, values)
top = peek(operators)
operators.append(token)
while peek(operators) is not None:
apply_operator(operators, values)
return values[0]
def main():
expression = "((20 - 10 ) * (30 - 20) / 10 + 10 ) * 2"
print(evaluate(expression))
if __name__ == "__main__":
main()
I reckon we can modify the code slightly to make it work for your case:
tokenize()
.ColumnA CONTAINS [1, 2, 3] AND (ColumnB FUZZY_MATCH 'bla' OR ColumnC EQUAL 45)
, we want a list of tokens:['ColumnA', 'CONTAINS', '[1, 2, 3]', 'AND', '(', 'ColumnB', 'FUZZY_MATCH', "'bla'", 'OR', 'ColumnC', 'EQUAL', '45', ')']
.is_number()
function to rather detect things like ColumnA
, [1, 2, 3]
etc.CONTAINS
/FUZZY_MATCH
/EQUAL
, operators AND
/OR
and parantheses (
/)
.greater_precedence(op1, op2)
to return true in case op1
is among ['CONTAINS', 'EQUAL', ..]
and op2
is ['AND', 'OR']
.contains
and equals
to be always evaluated before AND
/OR
.apply_operator(operators, values)
to implement logic of how to evaluate the boolean expression ColumnA CONTAINS [1, 2, 3]
or the expression true AND false
.CONTAINS
/FUZZY_MATCH
/EQUAL
/AND
/OR
etc all are operators here.Upvotes: 3