Reputation: 1
I have an issue with dataframe in pandas. I have a dataset of patients with their ids named stroke002,stroke003,stroke004, etc.. Every patient has gave 3 samples, so in dataframe for each patient stands 3 rows (based on number of sample, for example: stroke002-1,stroke002-2,stroke003-3,etc...). What I wanna do is to create a new dataframe in which I will only have single rows for each patient with average values from columns for all 3 samples. I dont have any ideas. I was thinking about nested loops with conditions, but I don't know how should I start.
Upvotes: 0
Views: 53
Reputation: 94
if I understand you correctly, you have a table that looks somehow like this:
import pandas as pd
df = pd.DataFrame(
[
["stroke002-1", 5, 0],
["stroke002-2", 2, 4],
["stroke002-3", 7, 3],
["stroke001-1", 5, 1],
["stroke001-2", 1, 2],
["stroke001-1", 3, 1],
], columns=["ids", "values1", "values2"]
)
ids values1 values2
0 stroke002-1 5 0
1 stroke002-2 2 4
2 stroke002-3 7 3
3 stroke001-1 5 1
4 stroke001-2 1 2
5 stroke001-1 3 1
First, you can remove the subscript in the ids
and keep only the first part of id before dash (i.e. stroke002 instead of stroke002-1) to be able to group the patients:
df["clean_ids"] = df["ids"].apply(lambda x: x.split("-")[0])
and then you can group by the newly created clean_ids
column to obtain averages of values across patients:
df.groupby("clean_ids")[["values1", "values2"]].mean()
which results into:
values1 values2
clean_ids
stroke001 3.000000 1.333333
stroke002 4.666667 2.333333
Upvotes: 1