Reputation: 25366
I have a data frame my_df
, then I want to create a new data frame new_df
. Each new_df
column is created by groupby
my_id
then take the max
of a column in my_df
.
Below is my code and it works fine. However, I am wondering is there any better approach? Especially in the future I will be dealing with hundreds of columns instead of just 6 columns? Thanks a lot!
tmp_df1 = my_df.groupby(['my_id'], as_index=False).col_A.agg({"max_A": "max"})
tmp_df2 = my_df.groupby(['my_id'], as_index=False).col_B.agg({"max_B": "max"})
tmp_df3 = my_df.groupby(['my_id'], as_index=False).col_C.agg({"max_C": "max"})
tmp_df4 = my_df.groupby(['my_id'], as_index=False).col_D.agg({"max_D": "max"})
tmp_df5 = my_df.groupby(['my_id'], as_index=False).col_E.agg({"max_E": "max"})
tmp_df6 = my_df.groupby(['my_id'], as_index=False).col_F.agg({"max_F": "max"})
combine_df1 = pd.merge(tmp_df1,tmp_df2,how="inner",on=['my_id'])
combine_df2 = pd.merge(combine_df1,tmp_df3,how="inner",on=['my_id'])
combine_df3 = pd.merge(combine_df2,tmp_df4,how="inner",on=['my_id'])
combine_df4 = pd.merge(combine_df3,tmp_df5,how="inner",on=['my_id'])
new_df = pd.merge(combine_df4,tmp_df6,how="inner",on=['my_id'])
Upvotes: 1
Views: 58
Reputation: 294458
Consider the example dataframe my_df
np.random.seed([3,1415])
my_df = pd.DataFrame(dict(
my_id=list('WXYZ') * 4,
col_A=np.random.randint(10, size=16),
col_B=np.random.randint(10, size=16),
col_C=np.random.randint(10, size=16),
col_D=np.random.randint(10, size=16),
col_E=np.random.randint(10, size=16),
col_F=np.random.randint(10, size=16),
))
my_df
col_A col_B col_C col_D col_E col_F my_id
0 0 3 6 1 9 3 W
1 2 2 4 5 8 5 X
2 7 4 7 2 0 8 Y
3 3 3 6 8 4 5 Z
4 8 3 2 2 3 1 W
5 7 6 6 4 0 5 X
6 0 7 6 7 4 1 Y
7 6 7 5 6 1 4 Z
8 8 4 2 9 5 3 W
9 6 5 8 4 8 9 X
10 0 3 7 2 6 5 Y
11 2 7 5 4 0 5 Z
12 0 5 8 6 8 7 W
13 4 9 4 3 7 0 X
14 9 8 7 8 4 3 Y
15 7 7 6 3 6 2 Z
You just want to use groupby
and max
new_df = my_df.groupby('my_id', as_index=False).max()
new_df
my_id col_A col_B col_C col_D col_E col_F
0 W 8 5 8 9 9 7
1 X 7 9 8 5 8 9
2 Y 9 8 7 8 6 8
3 Z 7 7 6 8 6 5
If you want to change the names of the columns:
my_df.groupby('my_id').max() \
.rename(columns=lambda x: x.replace('col_', '')) \
.add_suffix('_max').reset_index()
my_id A_max B_max C_max D_max E_max F_max
0 W 8 5 8 9 9 7
1 X 7 9 8 5 8 9
2 Y 9 8 7 8 6 8
3 Z 7 7 6 8 6 5
Or
my_df.groupby('my_id').max() \
.rename(columns=lambda x: x.rsplit('_', 1)[1] + '_max').reset_index()
my_id A_max B_max C_max D_max E_max F_max
0 W 8 5 8 9 9 7
1 X 7 9 8 5 8 9
2 Y 9 8 7 8 6 8
3 Z 7 7 6 8 6 5
Upvotes: 1