Reputation: 697
I have a data frame of 370000 rows x 340 columns, in this DataFrame I have mixed numeric and non-numeric values between columns.
I'd like to merge rows, basing on user_id
, that certain numeric values are summed, and non-numeric values are preserved.
Given table:
user_id points gender age
11111 10000 male 18
2222 4200 female 33
11111 30000 male 18
2222 10 female 33
11111 20000 male 18
I'd expect this result:
user_id points gender age
11111 60000 male 18
2222 4210 female 33
While summing rows, I need to sum approx. 300 values out of 340, but I'm new to pandas struggling to find the right way to do this, would appreciate any tips.
Upvotes: 1
Views: 90
Reputation: 59569
You want to group your original DataFrame
by 'user_id'
. Then you can aggregate the grouped frame and specify a dictionary for which function to use on each column.
df.groupby('user_id').agg({'points': 'sum',
'gender': 'first',
'age': 'first'}).reset_index()
Outputs:
user_id points gender age
0 2222 4210 female 33
1 11111 60000 male 18
First will work as long as there is only one unique value in the columns where you want to 'preserve' the value. Otherwise implement mode to keep the most common value, or 'unique' to get a list of all values.
Upvotes: 3
Reputation: 846
Hey you can do that using pandas aggregate sum
like
df.groupby(['user_id'])[["points"]].sum()
df.groupby(['user_id'])[["list_of_attribute"]].sum()
And its also possible to pass multiple columns
in list_of_attributes
but that would have to subset
of columns otherwise KeyError
.
Upvotes: 1