Reputation: 4470
I have a sample DF:
sample_df = pd.DataFrame(np.random.randint(1,20,size=(6, 2)), columns=list('AB'))
sample_df["A_cat"] = ["ind","sa","sa","sa","ind","ind"]
sample_df["B_cat"] = ["sa","ind","ind","sa","sa","sa"]
sample_df
OP:
A B A_cat B_cat
0 12 8 ind sa
1 12 11 sa ind
2 7 19 sa ind
3 5 11 sa sa
4 11 7 ind sa
5 6 18 ind sa
I have another sample DF 2 for which I am trying to replace the column values based on a condition:
sample_df2 = pd.DataFrame()
sample_df2["A_cat"] = ["sa","ind","ind","sa","sa","ind"]
sample_df2["B_cat"] = ["ind","sa","sa","ind","sa","sa"]
sample_df2
OP:
A_cat B_cat
0 sa ind
1 ind sa
2 ind sa
3 sa ind
4 sa sa
5 ind sa
Condition:
The value in sample_df2 should be replaced by taking a groupby mean of that value in sample_df.
For example, sample_df2(0,A_cat) = sa
which should be replaced by sample_df.groupby(["A_cat"])["A"].mean() for group value sa
sample OP of column A_cat
in sample_df2
after conversion will be:
sample_df2["A_cat"] = [8.0000,9.666667,9.666667,8.000,8.000,9.666667]
I have done the long for loop solution for this, any suggestions for pandas approach would be great!
Upvotes: 2
Views: 72
Reputation: 2895
This is a bit clunky, but it works:
The actual calculations are done once:
category_value_means = sample_df2.apply(
lambda column: sample_df.groupby(column.name)[column.name.split('_')[0]].mean())
whose output is:
A_cat B_cat
ind 9.666667 15
sa 8.000000 11
Afterwards, all is needed is some "lookup":
sample_df2.apply(lambda column: column.apply(lambda value: category_value_means[column.name][value]))
the output is:
A_cat B_cat
0 8.000000 15
1 9.666667 11
2 9.666667 11
3 8.000000 15
4 8.000000 11
5 9.666667 11
Upvotes: 0
Reputation: 467
Please check the following code and compare the elapsed time
import time
start_time = time.time()
categorical_cols = [col for col in df1.columns if len(col.split('_')) > 1]
numerical_cols = df1[df1.columns.difference(categorical_cols)].columns
unique_groups = df1['A_cat'].unique().tolist() # returns [ind, sa]
for cat, num in zip(categorical_cols, numerical_cols):
for group in unique_groups:
df2.loc[df2[cat] == group, cat] = df1.groupby(cat)[num].mean().loc[group]
print("Elapsed time:", time.time() - start)
Upvotes: 1