Dinosaurius
Dinosaurius

Reputation: 8628

How to group values over different columns

I have the following data frame:

df =
    ID    HOUR    GROUP_1     GROUP_2     GROUP_3     DURATION
    1     7       AAA         AAA         BBB         20
    2     7       BBB         AAA         CCC         22
    3     7       BBB         BBB         BBB         21
    4     8       AAA         AAA         AAA         23
    5     8       CCC         AAA         CCC         25
    6     9       CCC         CCC         CCC         28

I can calculate average DURATION per HOUR as follows:

grouped = df.groupby("HOUR").DURATION.mean().reset_index()

Now I need to also group the entried by group values stored in GROUP_1, GROUP_2 and GROUP_3. I want to consider the occurance of a group only once per row, i.e. if AAA is repeated two times in the first row, it should be considered only once.

The result should be:

result =
GROUP  HOUR  MEAN_DURATION
AAA    7     21
AAA    8     24
BBB    7     21
...

I know how to count the occurance of each group per row, but don't know how to put everything together to get the expected result:

df.filter(regex="^GROUP").stack().reset_index(level=1, drop=True).reset_index().drop_duplicates()[0].value_counts()

Upvotes: 0

Views: 49

Answers (2)

piRSquared
piRSquared

Reputation: 294218

Using to numpy to melt

groups = df.filter(like='GROUP').values

w = groups.shape[1]
hours = df.HOUR.values.repeat(w)
durtn = df.DURATION.values.repeat(w)

d1 = pd.DataFrame(
    dict(HOUR=hours, GROUP=groups.ravel(), DURATION=durtn)
).drop_duplicates()

d1

    DURATION GROUP  HOUR
0         20   AAA     7
2         20   BBB     7
3         22   BBB     7
4         22   AAA     7
5         22   CCC     7
6         21   BBB     7
9         23   AAA     8
12        25   CCC     8
13        25   AAA     8
15        28   CCC     9

then groupby + mean + reset_index

d1.groupby(['HOUR', 'GROUP']).mean().reset_index()

   HOUR GROUP  DURATION
0     7   AAA        21
1     7   BBB        21
2     7   CCC        22
3     8   AAA        24
4     8   CCC        25
5     9   CCC        28

all together

groups = df.filter(like='GROUP').values

w = groups.shape[1]
hours = df.HOUR.values.repeat(w)
durtn = df.DURATION.values.repeat(w)

pd.DataFrame(
    dict(HOUR=hours, GROUP=groups.ravel(), DURATION=durtn)
).drop_duplicates().groupby(
    ['HOUR', 'GROUP']
).mean().reset_index()

Upvotes: 0

akuiper
akuiper

Reputation: 214927

You can transform your group variables to one column, drop duplicated groups in each row and then group by hour and group to take the mean:

(pd.melt(df, id_vars=['ID', 'HOUR', 'DURATION'], value_name='GROUP')
 .drop('variable', axis=1).drop_duplicates()
 .groupby(['HOUR', 'GROUP']).DURATION.mean()
 .reset_index())

enter image description here

Upvotes: 3

Related Questions