Reputation: 611
I have a data frame df1:
df1 = pd.DataFrame([[40, 23, 22, 31], [12, 3, 11,23], [42, 16, 32, 34], [42, 13, 26, 34]], columns=['A', 'B', 'C', 'D'])
I have another data frame df2:
df2 = pd.DataFrame([["B","<20"],["A",">30"],["C","<40"],["D","<15"]], columns=["Column","Condition"])
Question: Select the data frame df1 based on any of the conditions present in df2. How to do it? Please help.
Expected Output example: For B Condition:
B_df = pd.DataFrame([3,16,13],columns=["B"])
For C Condition:
C_df = pd.DataFrame([22,11,32,26],columns=["C"])
Upvotes: 1
Views: 212
Reputation: 958
You can use any boolean Series as a condition (provided its index match df1
's index), e.g.:
cond = df1["A"]>30
print(pd.DataFrame(df1["A"][cond]))
For convenience, let consider a slightly different form of df2
:
df2 = pd.DataFrame([["B","<",20],["A",">",30],["C","<",40],["D","<",15]], columns=["Column","Operator","Value"])
and a dictionary that stores operators:
import operator
operator_dict={
">": operator.gt,
"<": operator.lt
}
Then the following will do the job:
for index, row in df2.iterrows():
cond = operator_dict[row["Operator"]](df1[row["Column"]],row["Value"])
print(pd.DataFrame(df1[row["Column"]][cond]))
Upvotes: 1
Reputation: 31146
Build a query string and you can flexibly use one or more of the filters.
df1 = pd.DataFrame([[40, 23, 22, 31], [12, 3, 11,23], [42, 16, 32, 34], [42, 13, 26, 34]], columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame([["B","<20"],["A",">30"],["C","<40"],["D","<15"]], columns=["Column","Condition"])
def myfilter(col):
if isinstance(col, str): col = [col]
mask = df2.Column.isin(col)
return df1.query(" and ".join([f"{c[0]}{c[1]}" for c in df2.loc[mask].values])).loc[:,col]
print(myfilter("B"))
print(myfilter("C"))
print(myfilter(["B","C"]))
output
B
1 3
2 16
3 13
C
0 22
1 11
2 32
3 26
B C
1 3 11
2 16 32
3 13 26
Upvotes: 1
Reputation: 14216
Another approach is to make functions to make this process repeatable.
OPS = {"<": "lt", ">": "gt"}
def get_conditions(df, idx):
column = df.loc[idx].Column
condition = df.loc[idx].Condition
attr = condition[0]
threshold = int(condition[1:])
return (column, attr, threshold)
def get_matching(conditions, df):
col, attr, thresh = conditions
fn = getattr(pd.Series, OPS[attr])
mask = fn(df[col], thresh)
return df.loc[mask, col]
for num in range(len(df2)):
conds = get_conditions(df2, num)
print(conds[0])
print("--------")
print(get_matching(conds, df1), end='\n\n')
B
--------
1 3
2 16
3 13
Name: B, dtype: int64
A
--------
0 40
2 42
3 42
Name: A, dtype: int64
C
--------
0 22
1 11
2 32
3 26
Name: C, dtype: int64
D
--------
Series([], Name: D, dtype: int64)
Upvotes: 1
Reputation: 195418
You can use DataFrame.query
for the task:
for c, cnd in zip(df2['Column'], df2['Condition']):
print(pd.DataFrame(df1[c], columns=[c]).query('{}{}'.format(c, cnd)))
print('-' * 80)
Prints four dataframes (based on conditions in df2
):
B
1 3
2 16
3 13
--------------------------------------------------------------------------------
A
0 40
2 42
3 42
--------------------------------------------------------------------------------
C
0 22
1 11
2 32
3 26
--------------------------------------------------------------------------------
Empty DataFrame
Columns: [D]
Index: []
--------------------------------------------------------------------------------
Upvotes: 3