Reputation: 1227
I would like to create logical masks based on one or more columns and one or more values in these columns in a pandas dataframe. These masks should then be applied to another column. In the simplest case, the mask might look like this:
mask = data['a'] == 4
newData = data['c'][mask]
However, more complex cases would also be possible:
mask = ((data['a'] == 4) | (data['a'] == 8)) & ((data['b'] == 1) | (data['b'] == 5))
newData = data['c'][mask]
In addition, multiple masks might be required. The main issue is that I don't know in advance
as this information would be provided by the user.
I thought that I could ask users to create an input file along these lines:
# <maskName> - <columnName>: <columnValue(s)> - <columnName>: <columnValue(s)> - etc.
maskA - a: 4, 8 - b: 1, 5 - c: 1
maskB - a: 0, 8 - c: 2, 6, 10
targetColumn: d
I could then read the input file and loop over it. By appropriately processing the lines, I could identify the number of required masks, the relevant columns, the relevant values, and the column to which the masks should be applied. I could also add this information to lists and/or dictionaries.
However, I'm not sure how best to deal with the issue that I don't know the number of masks/columns/values in advance and how to generate the appropriate masks once I know them. Any help would be greatly appreciated.
Upvotes: 0
Views: 314
Reputation: 60230
Because you can pass strings to df.query()
, finding the desired subset is really easy as long as you can convert your input format to a string. The parser I've written for your input format isn't super elegant but hopefully you get the idea:
import pandas as pd
import numpy as np
maskA_str = "maskA - a: 4, 8 - b: 1, 5 - c: 1"
df = pd.DataFrame(
{'a': np.random.randint(1, 10, 100),
'b': np.random.randint(1, 10, 100),
'c': np.random.randint(1, 10, 100)}
)
def create_query_str(mask_str):
mask_name, column_conds = mask_str.split('-')[0], mask_str.split('-')[1:]
query_str = '('
column_strs =[]
for cond in column_conds:
cond_str = '('
column, vals = cond.split(':')
column = column.strip()
test_strs = ['{c} == {v}'.format(c=column, v=val.strip())
for val in vals.split(',')]
cond_str += ' | '.join(test_strs)
cond_str += ')'
column_strs.append(cond_str)
query_str += ' & '.join(column_strs)
query_str += ')'
return query_str
create_query_str(maskA_str)
Out[17]: '((a == 4 | a == 8) & (b == 1 | b == 5) & (c == 1))'
# Can now be used directly in df.query()
df.query(create_query_str(maskA_str))
Upvotes: 1