Reputation: 511
This is my code:
frst_df = df.drop(columns=["Comment"]).groupby(['source'], as_index=False).agg('first')
cmnt_df = df.groupby(['source'], as_index=False)['Comment'].apply(', '.join)
merge_df = pd.merge(frst_df, cmnt_df , on='source')
I hope it is understandable what I'm trying to do here.
I have a large dataframe where I have a column 'source'. This is the primary column of the dataframe. Now for the column 'Comment', I want to join all comments corresponding to the value of the 'source'. There are approx 50 other columns in the dataframe. I want to pick only the first element from all the values corresponding to the 'source'.
The code I wrote works fine, but the dataframe is huge and it takes lots of time to create two separate dataframes and then merge them. Is there any better way to do this?
Upvotes: 3
Views: 1038
Reputation: 8768
This is another possible solution.
df['Comment'] = df.groupby('source')['Comment'].transform(lambda x: ','.join(x))
df = df.groupby('source').first()
Upvotes: 1
Reputation: 862511
You can use GroupBy.agg
by dictionary - all columns are aggregate by first only Comment
by join
:
df = pd.DataFrame({
'Comment':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'source':list('aaabbc')
})
d = dict.fromkeys(df.columns.difference(['source']), 'first')
d['Comment'] = ', '.join
merge_df = df.groupby('source', as_index=False).agg(d)
print (merge_df)
source B C Comment D E
0 a 4 7 a, b, c 1 5
1 b 5 4 d, e 7 9
2 c 4 3 f 0 4
Upvotes: 1