Reputation: 394
The problem I'm having doesn't seem too complex but I'm struggling to find a good way to go about it. All I need to do is a create a new dataframe from the data below the sums up the 'Minutes' column for each ID in 'P1'-'P5'. The only method I've come up with would be to create 5 different df's (1 for each 'P' column), append them and then use groupby to get the minute totals, but that seems wildy inefficient.
Any ideas on a better way to approach this would be greatly appreciated.
Original:
P1 P2 P3 P4 P5 Minutes
1627759 1628369 201143 202330 202681 7
1626179 1628369 202330 203382 203935 5
1627759 1628369 201143 202681 203382 5
1627759 201143 202681 202694 203935 5
1626179 1628369 201143 202330 203935 4
1626179 1627759 202681 202694 203382 4
1626179 1628369 202694 203382 203935 3
1626179 1627759 201143 202681 202694 2
1626179 1628464 202330 202694 203935 2
1627759 201143 202330 202681 202694 2
1628369 201143 202330 202681 203935 2
1626179 1627759 1628369 201143 202681 1
1626179 202330 202694 203382 203935 1
1626179 1627759 201143 202694 203935 1
1626179 1627824 1628400 1628464 202954 1
1626179 1628369 1628464 202330 203935 1
1626179 1627824 1628400 1628464 203935 1
1627759 202330 202681 202694 203382 0
Expected outcome:
P Minutes
1627759 27
1626179 26
1628369 28
201143 29
1628464 5
202330 24
1627824 2
202681 28
202694 20
1628400 2
203382 18
203935 25
202954 1
Upvotes: 3
Views: 113
Reputation: 323276
Using wide_to_long
pd.wide_to_long(df.reset_index(),['P'],i=['Minutes','index'],j='drop').\
reset_index(level=0).\
groupby('P').\
Minutes.\
sum()
P
201143 29
202330 24
202681 28
202694 20
202954 1
203382 18
203935 25
1626179 26
1627759 27
1627824 2
1628369 28
1628400 2
1628464 5
Name: Minutes, dtype: int64
Upvotes: 1
Reputation: 57033
I am not sure if melting the dataframe is more efficient, but at least it does not involve explicit loops:
pd.melt(df, id_vars='Minutes', value_vars=df.columns[:-1])\
.groupby('value')['Minutes'].sum()
This solution "stretches" the first five columns into a tall one and uses it for grouping. According to %%timeit
, this solution is twice as fast as yours (for the given dataframe).
Upvotes: 4