data_person
data_person

Reputation: 4470

Replace pandas column values based on another DF

I have a sample DF:

sample_df = pd.DataFrame(np.random.randint(1,20,size=(6, 2)), columns=list('AB'))
sample_df["A_cat"] = ["ind","sa","sa","sa","ind","ind"]
sample_df["B_cat"] = ["sa","ind","ind","sa","sa","sa"]
sample_df

OP:

    A   B   A_cat   B_cat
0   12  8   ind     sa
1   12  11  sa      ind
2   7   19  sa      ind
3   5   11  sa      sa
4   11  7   ind     sa
5   6   18  ind     sa

I have another sample DF 2 for which I am trying to replace the column values based on a condition:

sample_df2 = pd.DataFrame()
sample_df2["A_cat"] = ["sa","ind","ind","sa","sa","ind"]
sample_df2["B_cat"] = ["ind","sa","sa","ind","sa","sa"]
sample_df2

OP:

    A_cat   B_cat
0   sa     ind
1   ind    sa
2   ind    sa
3   sa     ind
4   sa     sa
5   ind    sa

Condition:

The value in sample_df2 should be replaced by taking a groupby mean of that value in sample_df.

For example, sample_df2(0,A_cat) = sa which should be replaced by sample_df.groupby(["A_cat"])["A"].mean() for group value sa

sample OP of column A_cat in sample_df2 after conversion will be:

sample_df2["A_cat"] = [8.0000,9.666667,9.666667,8.000,8.000,9.666667]

I have done the long for loop solution for this, any suggestions for pandas approach would be great!

Upvotes: 2

Views: 72

Answers (2)

Itamar Mushkin
Itamar Mushkin

Reputation: 2895

This is a bit clunky, but it works:

The actual calculations are done once:

category_value_means = sample_df2.apply(
    lambda column: sample_df.groupby(column.name)[column.name.split('_')[0]].mean())

whose output is:

    A_cat       B_cat
ind 9.666667    15
sa  8.000000    11

Afterwards, all is needed is some "lookup":

sample_df2.apply(lambda column: column.apply(lambda value: category_value_means[column.name][value]))

the output is:

    A_cat   B_cat
0   8.000000    15
1   9.666667    11
2   9.666667    11
3   8.000000    15
4   8.000000    11
5   9.666667    11

Upvotes: 0

isabella
isabella

Reputation: 467

Please check the following code and compare the elapsed time

import time

start_time = time.time()
categorical_cols = [col for col in df1.columns if len(col.split('_')) > 1]
numerical_cols = df1[df1.columns.difference(categorical_cols)].columns
unique_groups = df1['A_cat'].unique().tolist() # returns [ind, sa]

for cat, num in zip(categorical_cols, numerical_cols):
    for group in unique_groups:
        df2.loc[df2[cat] == group, cat] = df1.groupby(cat)[num].mean().loc[group]

print("Elapsed time:", time.time() - start)

Upvotes: 1

Related Questions