Construct Boolean masks based on unknown number of columns and values

Question

I would like to create logical masks based on one or more columns and one or more values in these columns in a pandas dataframe. These masks should then be applied to another column. In the simplest case, the mask might look like this:

mask = data['a'] == 4
newData = data['c'][mask]

However, more complex cases would also be possible:

mask = ((data['a'] == 4) | (data['a'] == 8)) & ((data['b'] == 1) | (data['b'] == 5))
newData = data['c'][mask]

In addition, multiple masks might be required. The main issue is that I don't know in advance

how many masks will be required, and
how many columns and
how many values in these columns will define the masks,

as this information would be provided by the user.

I thought that I could ask users to create an input file along these lines:

#   - :  - :  - etc.
maskA - a: 4, 8 - b: 1, 5 - c: 1
maskB - a: 0, 8 - c: 2, 6, 10

targetColumn: d

I could then read the input file and loop over it. By appropriately processing the lines, I could identify the number of required masks, the relevant columns, the relevant values, and the column to which the masks should be applied. I could also add this information to lists and/or dictionaries.

However, I'm not sure how best to deal with the issue that I don't know the number of masks/columns/values in advance and how to generate the appropriate masks once I know them. Any help would be greatly appreciated.

Marius · Accepted Answer

Because you can pass strings to df.query(), finding the desired subset is really easy as long as you can convert your input format to a string. The parser I've written for your input format isn't super elegant but hopefully you get the idea:

import pandas as pd
import numpy as np

maskA_str = "maskA - a: 4, 8 - b: 1, 5 - c: 1"
df = pd.DataFrame(
    {'a': np.random.randint(1, 10, 100),
     'b': np.random.randint(1, 10, 100),
     'c': np.random.randint(1, 10, 100)}
)

def create_query_str(mask_str):
    mask_name, column_conds = mask_str.split('-')[0], mask_str.split('-')[1:]
    query_str = '('
    column_strs =[]
    for cond in column_conds:
        cond_str = '('
        column, vals = cond.split(':')
        column = column.strip()
        test_strs = ['{c} == {v}'.format(c=column, v=val.strip())
                     for val in vals.split(',')]
        cond_str += ' | '.join(test_strs)
        cond_str += ')'
        column_strs.append(cond_str)
    query_str += ' & '.join(column_strs)
    query_str += ')'
    return query_str

create_query_str(maskA_str)
Out[17]: '((a == 4 | a == 8) & (b == 1 | b == 5) & (c == 1))'

# Can now be used directly in df.query()
df.query(create_query_str(maskA_str))

Construct Boolean masks based on unknown number of columns and values

Answers (1)

Related Questions