Reputation: 595
I am trying to develop the following filter with a pandas dataframe:
A
, B
, A_prime
and B_prime
;C
, then I would like to find the sum between A_prime
and B_prime
and allocate it to the maximum of A_prime
and B_prime
, while setting the minimum of A_prime
and B_prime
to zero.How would I write this as a pandas aggregation function?
A working example, written inefficiently, is shown below:
import pandas as pd
import numpy as np
data = {
"A":list(np.abs(np.random.randn(10))),
"B":list(np.abs(np.random.randn(10))),
"A_prime":list(np.abs(np.random.randn(10))),
"B_prime":list(np.abs(np.random.randn(10)))
}
df = pd.DataFrame.from_dict(data)
C = 0.2
print("BEFORE:")
print(df)
for index, row in df.iterrows():
if(row["A"] < C or row["B"] < C):
max_idx = np.argmax([row["A"], row["B"]])
if(max_idx==0):
row["A_prime"] = row["A_prime"] + row["B_prime"]
row["B_prime"] = 0
else:
row["B_prime"] = row["A_prime"] + row["B_prime"]
row["A_prime"] = 0
print("")
print("AFTER:")
print(df)
Output:
BEFORE:
A B A_prime B_prime
0 0.182445 0.924890 1.563398 0.562325
1 0.252587 0.273637 0.515395 0.538876
2 1.369412 1.985702 1.813962 1.643794
3 0.834666 0.143880 0.860673 0.372468
4 1.380012 0.715774 0.022681 0.892717
5 0.582497 0.477100 0.956821 1.134613
6 0.083045 0.322060 0.362513 1.386124
7 1.384267 0.251577 0.639843 0.458650
8 0.375456 0.412320 0.661661 0.086588
9 0.079226 0.385621 0.601451 0.837827
AFTER:
A B A_prime B_prime
0 0.182445 0.924890 0.000000 2.125723
1 0.252587 0.273637 0.515395 0.538876
2 1.369412 1.985702 1.813962 1.643794
3 0.834666 0.143880 1.233141 0.000000
4 1.380012 0.715774 0.022681 0.892717
5 0.582497 0.477100 0.956821 1.134613
6 0.083045 0.322060 0.000000 1.748638
7 1.384267 0.251577 0.639843 0.458650
8 0.375456 0.412320 0.661661 0.086588
9 0.079226 0.385621 0.000000 1.439278
Upvotes: 4
Views: 254
Reputation: 445
I don't know if this can be done with aggregation. I am sharing something that has worked for me. See if it works for you too.
Firstly, we can generate the masks
#Get mask of all the rows where A_Prime > B_Prime and either of A or B less than C(0.2)
mask_val_a = ((dataset['A'] < C) | (dataset['B'] < C)) & \
(dataset['A_Prime'] > dataset['B_Prime'])
#Get mask of all the rows where A_Prime < B_Prime and either of A or B less than C(0.2)
mask_val_b = ((dataset['A'] < C) | (dataset['B'] < C)) & (dataset['A_Prime'] < dataset['B_Prime'])
Once we have masks values, we can just use pandas mask to update the values
#Firstly we will update the values where A_Prime > B_Prime
dataset['A_Prime'] = dataset['A_Prime'].mask(mask_val_a, dataset[['A_Prime', 'B_Prime']].sum(axis = 1))
dataset['B_Prime'] = dataset['B_Prime'].mask(mask_val_a, 0)
#Then we will update the values where B_Prime > A_Prime
dataset['B_Prime'] = dataset['B_Prime'].mask(mask_val_b, dataset[['A_Prime', 'B_Prime']].sum(axis = 1))
dataset['A_Prime'] = dataset['A_Prime'].mask(mask_val_b, 0)
Upvotes: 0
Reputation: 2421
You can rewrite values based on a conditional map and apply
.
# Columns to check the min/max on:
check_max_cols = ["A_prime", "B_prime"]
def allocate_sum(row):
# Identify max and min values
max_col = row[check_max_cols].idxmax(axis=1)
min_col = check_max_cols[1] if max_col == check_max_cols[0] else check_max_cols[0]
row[max_col] = row[["A_prime", "B_prime"]].sum()
row[min_col] = 0
return row
below_threshold = (df[["A", "B"]] < C).any(axis=1)
df.loc[below_threshold, :] = df.loc[below_threshold, :].apply(allocate_sum, axis=1)
Upvotes: 1
Reputation: 631
The most pandified way of doing this would be with an apply()
, as the example below:
import pandas as pd
import numpy as np
data = {
"A":list(np.abs(np.random.randn(10))),
"B":list(np.abs(np.random.randn(10))),
"A_prime":list(np.abs(np.random.randn(10))),
"B_prime":list(np.abs(np.random.randn(10)))
}
df = pd.DataFrame.from_dict(data)
C = 0.2
def A_B_prime(row):
A_prime_val = row["A_prime"]
B_prime_val = row["B_prime"]
if(row["A"] < C or row["B"] < C):
max_idx = np.argmax([row["A"], row["B"]])
if(max_idx==0):
A_prime_val = row["A_prime"] + row["B_prime"]
B_prime_val = 0
else:
B_prime_val = row["A_prime"] + row["B_prime"]
A_prime_val = 0
return A_prime_val, B_prime_val
df['A_prime'], df['B_prime'] = zip(*df.apply(A_B_prime, axis=1))
You can find some good insights on how to return multiple columns from a single apply()
on this thread.
Upvotes: 1
Reputation: 18306
here is one way:
prime_cols = ["A_prime", "B_prime"]
# get the candidate sums
prime_sums = df[prime_cols].sum(axis=1)
# check which rows satisfy the `C` threshold
threshold_satisfied = df.A.lt(C) | df.B.lt(C)
# set the satisfying rows' values to sums for both columns
df.loc[threshold_satisfied, prime_cols] = prime_sums
# generate a 1-0 mask that will multiply the greater value by 1 and
# smaller value by 0 to "select" one of them and kill other
mask_A_side = df.A.gt(df.B)
the_mask = pd.concat([mask_A_side, ~mask_A_side], axis=1).set_axis(prime_cols, axis=1)
# multiply with the mask
df.loc[threshold_satisfied, prime_cols] *= the_mask
which first puts the sum of prime columns to both of the columns where threshold condition is satisfied, and then kills one of them with a 1-0 mask multiplication.
to get
>>> df
A B A_prime B_prime
0 0.182445 0.924890 0.000000 2.125723
1 0.252587 0.273637 0.515395 0.538876
2 1.369412 1.985702 1.813962 1.643794
3 0.834666 0.143880 1.233141 0.000000
4 1.380012 0.715774 0.022681 0.892717
5 0.582497 0.477100 0.956821 1.134613
6 0.083045 0.322060 0.000000 1.748637
7 1.384267 0.251577 0.639843 0.458650
8 0.375456 0.412320 0.661661 0.086588
9 0.079226 0.385621 0.000000 1.439278
Upvotes: 2