David Ross
David Ross

Reputation: 1167

Pandas: calculating the mean values of duplicate entries in a dataframe

I have been working with a dataframe in python and pandas that contains duplicate entries in the first column. The dataframe looks something like this:

    sample_id    qual    percent
0   sample_1      10        20
1   sample_2      20        30
2   sample_1      50        60
3   sample_2      10        90
4   sample_3      100       20

I want to write something that identifies duplicate entries within the first column and calculates the mean values of the subsequent columns. An ideal output would be something similar to the following:

    sample_id    qual    percent
0   sample_1      30        40
1   sample_2      15        60
2   sample_3      100       20

I have been struggling with this problem all afternoon and would appreciate any help.

Upvotes: 37

Views: 38667

Answers (2)

kinjo
kinjo

Reputation: 66

Groupby will work.

data.groupby('sample_id').mean()

You can then use reset_index() to make look exactly as you want.

Upvotes: 5

piRSquared
piRSquared

Reputation: 294318

groupby the sample_id column and use mean

df.groupby('sample_id').mean().reset_index()
or
df.groupby('sample_id', as_index=False).mean()

get you

enter image description here

Upvotes: 64

Related Questions