Reputation: 21
Suppose I have a dataset with three columns: Name, Application and Duration.
I am trying to figure how to group by Name and Application, where a different hop to another application will end the current grouping and start a new one, and if I return to the original application, it will count it as a new grouping
The image here illustrates the table
My desired output would be:
1. John, Excel, 5 mins
2. John, Spotify, 1 mins
3. John, Excel, 1 mins
4. John, Spotify, 2 mins
5. Emily, Excel, 5 mins
6. John, Excel, 3 mins
I have been attempting to do this in Pandas but I cannot manage to ensure that it aggregates by different application hops, even if it comes back to a previous application.
Upvotes: 2
Views: 1777
Reputation: 6802
You can use Pandas .shift()
to compare the values of the series with the next row, build up a session value based on the "hops", and then group by that session value.
import pandas as pd
df = pd.DataFrame({
'name' : ['John', 'John', 'John', 'John', 'John', 'Emily', 'Emily', 'John'],
'app' : ['Excel','Excel','Spotify','Excel','Spotify','Excel', 'Excel', 'Excel'],
'duration':[3,2,1,1,2,4,1,3]})
session = ((df.name != df.name.shift()) | (df.app != df.app.shift())).cumsum()
df2 = df.groupby(['name', 'app', session], as_index=False, sort=False)['duration'].sum()
print(df2)
Output:
name app duration
0 John Excel 5
1 John Spotify 1
2 John Excel 1
3 John Spotify 2
4 Emily Excel 5
5 John Excel 3
Upvotes: 3
Reputation: 1
One solution would be to add a column to define hops. Then group by that column
hop_id = 1
for i in df.index:
df.loc[i,'hop_id'] = hop_id
if (df.loc[i,'Name']!= df.loc[i+1,'Name']) or (df.loc[i,'Application'] != df.loc[i+1,'Application']):
hop_id = hop_id +1
df.groupby('hop_id')['Duration'].sum()
Upvotes: 0