Jakub
Jakub

Reputation: 1

Pandas. Taking average of all columns by rows with condition

I have an issue with dataframe in pandas. I have a dataset of patients with their ids named stroke002,stroke003,stroke004, etc.. Every patient has gave 3 samples, so in dataframe for each patient stands 3 rows (based on number of sample, for example: stroke002-1,stroke002-2,stroke003-3,etc...). What I wanna do is to create a new dataframe in which I will only have single rows for each patient with average values from columns for all 3 samples. I dont have any ideas. I was thinking about nested loops with conditions, but I don't know how should I start.

Dataframe

Upvotes: 0

Views: 53

Answers (1)

artemshramko
artemshramko

Reputation: 94

if I understand you correctly, you have a table that looks somehow like this:

import pandas as pd

df = pd.DataFrame(
  [
    ["stroke002-1", 5, 0],
    ["stroke002-2", 2, 4],
    ["stroke002-3", 7, 3],
    ["stroke001-1", 5, 1],
    ["stroke001-2", 1, 2],
    ["stroke001-1", 3, 1],
  ], columns=["ids", "values1", "values2"]
)
    ids          values1    values2
0   stroke002-1  5          0
1   stroke002-2  2          4
2   stroke002-3  7          3 
3   stroke001-1  5          1
4   stroke001-2  1          2
5   stroke001-1  3          1

First, you can remove the subscript in the ids and keep only the first part of id before dash (i.e. stroke002 instead of stroke002-1) to be able to group the patients:

df["clean_ids"] = df["ids"].apply(lambda x: x.split("-")[0])

and then you can group by the newly created clean_ids column to obtain averages of values across patients:

df.groupby("clean_ids")[["values1", "values2"]].mean()

which results into:

            values1   values2
clean_ids                    
stroke001  3.000000  1.333333
stroke002  4.666667  2.333333

Upvotes: 1

Related Questions