JohnJohn45411
JohnJohn45411

Reputation: 21

Python Pandas GroupBy until Value change

Suppose I have a dataset with three columns: Name, Application and Duration.

I am trying to figure how to group by Name and Application, where a different hop to another application will end the current grouping and start a new one, and if I return to the original application, it will count it as a new grouping

The image here illustrates the table

enter image description here

My desired output would be:

1. John, Excel, 5 mins
2. John, Spotify, 1 mins
3. John, Excel, 1 mins
4. John, Spotify, 2 mins
5. Emily, Excel, 5 mins 
6. John, Excel, 3 mins

I have been attempting to do this in Pandas but I cannot manage to ensure that it aggregates by different application hops, even if it comes back to a previous application.

Upvotes: 2

Views: 1777

Answers (2)

Chris Sears
Chris Sears

Reputation: 6802

You can use Pandas .shift() to compare the values of the series with the next row, build up a session value based on the "hops", and then group by that session value.

import pandas as pd

df = pd.DataFrame({
    'name' : ['John', 'John', 'John',   'John', 'John',   'Emily', 'Emily', 'John'],
    'app'  : ['Excel','Excel','Spotify','Excel','Spotify','Excel', 'Excel', 'Excel'], 
    'duration':[3,2,1,1,2,4,1,3]})

session = ((df.name != df.name.shift()) | (df.app != df.app.shift())).cumsum()
df2 = df.groupby(['name', 'app', session], as_index=False, sort=False)['duration'].sum()
print(df2)

Output:

    name      app  duration
0   John    Excel         5
1   John  Spotify         1
2   John    Excel         1
3   John  Spotify         2
4  Emily    Excel         5
5   John    Excel         3

Upvotes: 3

SimulationsWithBob
SimulationsWithBob

Reputation: 1

One solution would be to add a column to define hops. Then group by that column

hop_id = 1
for i in df.index:
    df.loc[i,'hop_id'] = hop_id
    if (df.loc[i,'Name']!= df.loc[i+1,'Name']) or (df.loc[i,'Application'] != df.loc[i+1,'Application']):
        hop_id = hop_id +1

df.groupby('hop_id')['Duration'].sum()

Upvotes: 0

Related Questions