Gastyr
Gastyr

Reputation: 123

String treatment with blacklist and whitelist in Python

I am creating a program, and I need to create a logic that handles user input for the eval() function. The input will be a math function, I want to handle some irregularities and make sure the string is a math function and not malicious code.

For this I created a logic that compares all characters of the string with blacklist and whitelist, the problem is that the string can only contain a few characters in a specific arrangement, for example cos, the string cannot contain c + o * s.

whitelist = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '(', ')',
            'x', 'y', 'sin', 'cos', 'tg', '+', '-', '*', '/', ' ']

blacklist = ['a', 'b', 'd', 'f', 'h', 'i', 'j', 'k', 'l', 'm', 'p', 'q',
            'r', 'u', 'v', 'w', 'z']

def stringTreat(string):
    if not any(ch in string for ch in blacklist):
        if all(ch in whitelist for ch in string):
            print('OK!!')
        else:
            print('stop at whitelist')
    else:
        print('stop at blacklist')

string = input('input:')

stringTreat(string)

If I set 12 + 67 - 82 to the input of this example, the output is OK!!, but if cos(x) is the input, the output changes to stop at whitelist.

How can I create a logic to accept substrings e.g. (sin, cos, tg), characters e.g. (0, 1, 2, 3...) and doesn't accept other substrings and characters e.g. (a, f, @, $, ls, mv)?

Upvotes: 2

Views: 2060

Answers (1)

scnerd
scnerd

Reputation: 6103

What you're trying to build is generally called a parser, for which there are a number of established algorithms that you might find useful (consider looking into the ply package).

Generally, this is broken into two steps: a tokenizer, and a grammar. The tokenizer breaks the input string into pieces, and maybe tags them with a little extra information (e.g., 12 + cos(3) might become [NUM(12), OP(+), FUNC(cos), LPAREN, NUM(3), RPAREN]). Note that you can build a very simple tokenizer using a regular expression like the following:

In [1]: re.split(r'\b', '12 + 16 - cos(2)')
Out[1]: ['', '12', ' + ', '16', ' - ', 'cos', '(', '2', ')']

In [2]: [v.strip() for v in re.split(r'\b', '12 + 16 - cos(2)') if v.strip()]
Out[2]: ['12', '+', '16', '-', 'cos', '(', '2', ')']

The grammar then looks for pattern of tokens, and can tell what to do with them, usually forming them into some sort of a "syntax tree" which is more easily operated on later. E.g., you might consider then entire function to be a single unary expression, EXPR(cos, NUM(3)), then the addition operation to be another binary expression, EXPR(add, NUM(12), EXPR(cos, NUM(3))). Note that this tree is now easy: when you encounter an expression, look at the operator in the first position ('add', 'cos', etc.), and use that to figure out what to do with the remaining operands. These can be handled recursively, so the inner expression resolves to some number, which the outer expression can then use to resolve to a final, single number.

You don't have to do things that way, but having that background suggests that, instead of doing everything all at once like you're trying, try having a tokenizer first, then you just have some token STR(cos) or STR(ls), and you can easily recognize the former as a valid input, and throw an error if you encounter the other (or anything else not on your whitelist.

As a side note, you generally only have either a white list or a black list, not both. A white list usually assumes that anything else is invalid, and a black list assumes that anything else is valid, so having both introduces issues if something falls into both lists or neither list.

As a final note, since you're using Python, if you're careful and you're alright permitting general Python syntax, you can use eval and exec to do the parsing and execution for you. E.g.:

In [1]: import math

In [2]: eval('12 + 16 - cos(2)', {'cos': math.cos}, {})
Out[2]: 28.416146836547142

You can specify in these dictionaries what functions you want the user to have access to, and block them from interacting with anything else in your program's state. I still probably wouldn't do this unless you trust the user at least a little, or if they can only hurt themselves by screwing things up.

Upvotes: 2

Related Questions