In pandas, how can I identify records that share a common value and replace the value of one of them to match the other?

Question

I have a pandas dataframe with three columns:

a          b          c
Donaldson  Minnesota  2020
Ozuna      Atlanta    2020
Betts      Boston     2019
Donaldson  Atlanta    2019
Ozuna      St. Louis  2019
Torres     New York   2019

I want to identify all column names that have more than one column c value, and then replace all column b instances with the first value in the dataframe like this:

a          b          c
Donaldson  Minnesota  2020
Ozuna      Atlanta    2020
Betts      Boston     2019
Donaldson  Minnesota  2019
Ozuna      Atlanta    2019
Torres     New York   2019

This is definitely inefficient, but here's what I tried so far:

# get a df of just names and cities and deduplicate

df_names = df[['a','b']].drop_duplicates()


# find any multiple column b values and put them in a list

a_matches = pd.Dataframe(df_names.groupby('a')['b'].nunique())
multi_b = a_matches.index[a_matches['b'] > 1].tolist()

This gives me ['Donaldson','Ozuna'], but now I am stuck. I can't come up with a good way to generate a replacement dictionary for their corresponding values in c. I think there must be a more elegant way to get to this.

anky · Accepted Answer

IIUC, you can try with groupby+transform with np.where:

g = df.groupby('a')
c = g['c'].transform('nunique').gt(1) # column a names that have >1 column c value
df['b'] = np.where(c,g['b'].transform('first'),df['b'])
# for a new df: new = df.assign(b=np.where(c,g['b'].transform('first'),df['b']))

print(df)

         a          b     c
0  Donaldson  Minnesota  2020
1      Ozuna    Atlanta  2020
2      Betts     Boston  2019
3  Donaldson  Minnesota  2019
4      Ozuna    Atlanta  2019
5     Torres   New York  2019

For the given example as @ALloz correctly pointed , you can just use:

df['b'] = df.groupby('a')['b'].transform('first')
print(df)

           a          b     c
0  Donaldson  Minnesota  2020
1      Ozuna    Atlanta  2020
2      Betts     Boston  2019
3  Donaldson  Minnesota  2019
4      Ozuna    Atlanta  2019
5     Torres   New York  2019

In pandas, how can I identify records that share a common value and replace the value of one of them to match the other?

Answers (1)

Related Questions