Reputation: 8628
I have the following data frame:
df =
ID HOUR GROUP_1 GROUP_2 GROUP_3 DURATION
1 7 AAA AAA BBB 20
2 7 BBB AAA CCC 22
3 7 BBB BBB BBB 21
4 8 AAA AAA AAA 23
5 8 CCC AAA CCC 25
6 9 CCC CCC CCC 28
I can calculate average DURATION
per HOUR
as follows:
grouped = df.groupby("HOUR").DURATION.mean().reset_index()
Now I need to also group the entried by group values stored in GROUP_1
, GROUP_2
and GROUP_3
. I want to consider the occurance of a group only once per row, i.e. if AAA
is repeated two times in the first row, it should be considered only once.
The result should be:
result =
GROUP HOUR MEAN_DURATION
AAA 7 21
AAA 8 24
BBB 7 21
...
I know how to count the occurance of each group per row, but don't know how to put everything together to get the expected result:
df.filter(regex="^GROUP").stack().reset_index(level=1, drop=True).reset_index().drop_duplicates()[0].value_counts()
Upvotes: 0
Views: 49
Reputation: 294218
Using to numpy
to melt
groups = df.filter(like='GROUP').values
w = groups.shape[1]
hours = df.HOUR.values.repeat(w)
durtn = df.DURATION.values.repeat(w)
d1 = pd.DataFrame(
dict(HOUR=hours, GROUP=groups.ravel(), DURATION=durtn)
).drop_duplicates()
d1
DURATION GROUP HOUR
0 20 AAA 7
2 20 BBB 7
3 22 BBB 7
4 22 AAA 7
5 22 CCC 7
6 21 BBB 7
9 23 AAA 8
12 25 CCC 8
13 25 AAA 8
15 28 CCC 9
then groupby
+ mean
+ reset_index
d1.groupby(['HOUR', 'GROUP']).mean().reset_index()
HOUR GROUP DURATION
0 7 AAA 21
1 7 BBB 21
2 7 CCC 22
3 8 AAA 24
4 8 CCC 25
5 9 CCC 28
all together
groups = df.filter(like='GROUP').values
w = groups.shape[1]
hours = df.HOUR.values.repeat(w)
durtn = df.DURATION.values.repeat(w)
pd.DataFrame(
dict(HOUR=hours, GROUP=groups.ravel(), DURATION=durtn)
).drop_duplicates().groupby(
['HOUR', 'GROUP']
).mean().reset_index()
Upvotes: 0
Reputation: 214927
You can transform your group variables to one column, drop duplicated groups in each row and then group by hour
and group
to take the mean:
(pd.melt(df, id_vars=['ID', 'HOUR', 'DURATION'], value_name='GROUP')
.drop('variable', axis=1).drop_duplicates()
.groupby(['HOUR', 'GROUP']).DURATION.mean()
.reset_index())
Upvotes: 3