How to group values over different columns

Question

I have the following data frame:

df =
    ID    HOUR    GROUP_1     GROUP_2     GROUP_3     DURATION
    1     7       AAA         AAA         BBB         20
    2     7       BBB         AAA         CCC         22
    3     7       BBB         BBB         BBB         21
    4     8       AAA         AAA         AAA         23
    5     8       CCC         AAA         CCC         25
    6     9       CCC         CCC         CCC         28

I can calculate average DURATION per HOUR as follows:

grouped = df.groupby("HOUR").DURATION.mean().reset_index()

Now I need to also group the entried by group values stored in GROUP_1, GROUP_2 and GROUP_3. I want to consider the occurance of a group only once per row, i.e. if AAA is repeated two times in the first row, it should be considered only once.

The result should be:

result =
GROUP  HOUR  MEAN_DURATION
AAA    7     21
AAA    8     24
BBB    7     21
...

I know how to count the occurance of each group per row, but don't know how to put everything together to get the expected result:

df.filter(regex="^GROUP").stack().reset_index(level=1, drop=True).reset_index().drop_duplicates()[0].value_counts()

akuiper · Accepted Answer

You can transform your group variables to one column, drop duplicated groups in each row and then group by hour and group to take the mean:

(pd.melt(df, id_vars=['ID', 'HOUR', 'DURATION'], value_name='GROUP')
 .drop('variable', axis=1).drop_duplicates()
 .groupby(['HOUR', 'GROUP']).DURATION.mean()
 .reset_index())

How to group values over different columns

Answers (2)

Related Questions