Asif Iqbal
Asif Iqbal

Reputation: 511

Groupby on a column and apply function on another column but keep first element of all other columns of dataframe

This is my code:

frst_df = df.drop(columns=["Comment"]).groupby(['source'], as_index=False).agg('first')
cmnt_df = df.groupby(['source'], as_index=False)['Comment'].apply(', '.join)
merge_df = pd.merge(frst_df, cmnt_df , on='source')

I hope it is understandable what I'm trying to do here.

I have a large dataframe where I have a column 'source'. This is the primary column of the dataframe. Now for the column 'Comment', I want to join all comments corresponding to the value of the 'source'. There are approx 50 other columns in the dataframe. I want to pick only the first element from all the values corresponding to the 'source'.

The code I wrote works fine, but the dataframe is huge and it takes lots of time to create two separate dataframes and then merge them. Is there any better way to do this?

Upvotes: 3

Views: 1038

Answers (2)

rhug123
rhug123

Reputation: 8768

This is another possible solution.

df['Comment'] = df.groupby('source')['Comment'].transform(lambda x: ','.join(x))
df = df.groupby('source').first()

Upvotes: 1

jezrael
jezrael

Reputation: 862511

You can use GroupBy.agg by dictionary - all columns are aggregate by first only Comment by join:

df = pd.DataFrame({
        'Comment':list('abcdef'),
         'B':[4,5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,5,7,1,0],
         'E':[5,3,6,9,2,4],
         'source':list('aaabbc')
})

d = dict.fromkeys(df.columns.difference(['source']), 'first')
d['Comment'] = ', '.join

merge_df = df.groupby('source', as_index=False).agg(d)
print (merge_df)
  source  B  C  Comment  D  E
0      a  4  7  a, b, c  1  5
1      b  5  4     d, e  7  9
2      c  4  3        f  0  4

Upvotes: 1

Related Questions