Reputation: 87
I have a pandas dataframe with three columns:
a b c
Donaldson Minnesota 2020
Ozuna Atlanta 2020
Betts Boston 2019
Donaldson Atlanta 2019
Ozuna St. Louis 2019
Torres New York 2019
I want to identify all column names that have more than one column c value, and then replace all column b instances with the first value in the dataframe like this:
a b c
Donaldson Minnesota 2020
Ozuna Atlanta 2020
Betts Boston 2019
Donaldson Minnesota 2019
Ozuna Atlanta 2019
Torres New York 2019
This is definitely inefficient, but here's what I tried so far:
# get a df of just names and cities and deduplicate
df_names = df[['a','b']].drop_duplicates()
# find any multiple column b values and put them in a list
a_matches = pd.Dataframe(df_names.groupby('a')['b'].nunique())
multi_b = a_matches.index[a_matches['b'] > 1].tolist()
This gives me ['Donaldson','Ozuna'], but now I am stuck. I can't come up with a good way to generate a replacement dictionary for their corresponding values in c. I think there must be a more elegant way to get to this.
Upvotes: 3
Views: 49
Reputation: 75120
IIUC, you can try with groupby+transform
with np.where
:
g = df.groupby('a')
c = g['c'].transform('nunique').gt(1) # column a names that have >1 column c value
df['b'] = np.where(c,g['b'].transform('first'),df['b'])
# for a new df: new = df.assign(b=np.where(c,g['b'].transform('first'),df['b']))
print(df)
a b c
0 Donaldson Minnesota 2020
1 Ozuna Atlanta 2020
2 Betts Boston 2019
3 Donaldson Minnesota 2019
4 Ozuna Atlanta 2019
5 Torres New York 2019
For the given example as @ALloz correctly pointed , you can just use:
df['b'] = df.groupby('a')['b'].transform('first')
print(df)
a b c
0 Donaldson Minnesota 2020
1 Ozuna Atlanta 2020
2 Betts Boston 2019
3 Donaldson Minnesota 2019
4 Ozuna Atlanta 2019
5 Torres New York 2019
Upvotes: 5