Armon
Armon

Reputation: 113

Pandas DataFrame groupby based on condition

The most similar question I found was here but with no proper answer.

Basically I have an issue where I'm trying to use groupby on a dataframe to generate unique IDs for bus routes. The problem is, the data I have at my disposal sometimes (though rarely) has the same values for my groupby columns, so they're considered the same bus even though they aren't.

The only other way I can think of is to group buses based on another column called "Type of stop", where there is an indicator for Start, Middle and End. I'd like to use groupby to create groups based on this column where each group starts where "type of stop" = Start, and ends where "type of stop" = End.

Consider the following data:

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
    'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3)})

   Cond   Position
0     A   START
1     A   MID  
2     A   MID   
3     A   END    
4     A   MID    
5     A   START   
6     A   START   
7     A   MID    
8     A   MID    
9     A   END    
10    A   MID   
11    A   START    
12    A   START    
13    A   MID    
14    A   MID    
15    A   END     
16    A   MID    
17    A   START

The only way I came up with to accurately group these buses together is to generate an additional column with the bus sequence id, but given that I'm working with lots of data, this isn't a very efficient solution. I'm hoping to be able to manage what I want to do with a single groupby, if possible, in order to generate the following output

   Cond   Position   Group
0     A   START      1
1     A   MID        1
2     A   MID        1
3     A   END        1
4     A   MID        
5     A   START      2
6     A   START      2
7     A   MID        2
8     A   MID        2
9     A   END        2 
10    A   MID        
11    A   START      3
12    A   START      3 
13    A   MID        3
14    A   MID        3
15    A   END        3 
16    A   MID        
17    A   START      4

Upvotes: 2

Views: 256

Answers (2)

jpp
jpp

Reputation: 164613

One idea is to factorize via np.select, then use a custom loop via numba:

from numba import njit

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
                   'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})

@njit
def grouper(pos):
    res = np.empty(pos.shape)
    num = 1
    started = 0
    for i in range(len(res)):
        current_pos = pos[i]
        if (started == 0) and (current_pos == 0):
            started = 1
            res[i] = num
        elif (started == 1) and (current_pos == 1):
            started = 0
            res[i] = num
            num += 1
        elif (started == 1) and (current_pos in [-1, 0]):
            res[i] = num
        else:
            res[i] = 0
    return res

arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)

df['Group'] = grouper(arr).astype(int)

Result:

print(df)

   Position Vehicle_ID  Group
0     START          A      1
1       MID          A      1
2       MID          A      1
3       END          A      1
4       MID          A      0
5     START          A      2
6     START          A      2
7       MID          A      2
8       MID          A      2
9       END          A      2
10      MID          A      0
11    START          A      3
12    START          A      3
13      MID          A      3
14      MID          A      3
15      END          A      3
16      MID          A      0
17    START          A      4

In my opinion, you should not include "blank" values as this would force your series to be object dtype, inefficient for any subsequent processing. As above, you can use 0 instead.

Performance benchmarking

numba is around ~10x faster than one pure Pandas approach:-

import pandas as pd, numpy as np
from numba import njit

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
                   'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})


df = pd.concat([df]*10, ignore_index=True)

assert joz(df.copy()).equals(jpp(df.copy()))

%timeit joz(df.copy())  # 18.6 ms per loop
%timeit jpp(df.copy())  # 1.95 ms per loop

Benchmarking functions:

def joz(df):
    # identification of sequences
    df['Position_Prev'] = df['Position'].shift(1)
    df['Sequence'] = 0
    df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
    df.loc[df['Position'] == 'END', 'Sequence'] = -1
    df['Sequence_Sum'] = df['Sequence'].cumsum()
    df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1

    # take only items between START and END and generate Group number
    df2 = df[df['Sequence_Sum'] == 1].copy()
    df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
    df2['Group'] = df2['Sequence'].cumsum()

    # merge results to one dataframe
    df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
    df['Group'] = df['Group'].fillna(0)
    df['Group'] = df['Group'].astype(int)
    df.drop(['Position_Prev', 'Sequence', 'Sequence_Sum'], axis=1, inplace=True)    
    return df

@njit
def grouper(pos):
    res = np.empty(pos.shape)
    num = 1
    started = 0
    for i in range(len(res)):
        current_pos = pos[i]
        if (started == 0) and (current_pos == 0):
            started = 1
            res[i] = num
        elif (started == 1) and (current_pos == 1):
            started = 0
            res[i] = num
            num += 1
        elif (started == 1) and (current_pos in [-1, 0]):
            res[i] = num
        else:
            res[i] = 0
    return res

def jpp(df):
    arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)
    df['Group'] = grouper(arr).astype(int)
    return df

Upvotes: 1

y0j0
y0j0

Reputation: 3592

I have some solution. You have to avoid loops and try to using sliding, slicing and merging.

This is my first prototype (should be refactored)

# identification of sequences
df['Position_Prev'] = df['Position'].shift(1)
df['Sequence'] = 0
df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
df.loc[df['Position'] == 'END', 'Sequence'] = -1
df['Sequence_Sum'] = df['Sequence'].cumsum()
df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1

# take only items between START and END and generate Group number
df2 = df[df['Sequence_Sum'] == 1].copy()
df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
df2['Group'] = df2['Sequence'].cumsum()

# merge results to one dataframe
df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
df['Group'] = df['Group'].fillna(0)
df['Group'] = df['Group'].astype(int)
df.drop(columns=['Position_Prev', 'Sequence', 'Sequence_Sum'], inplace=True)
df

Result:

Vehicle_ID Position  Group
0           A    START      1
1           A      MID      1
2           A      MID      1
3           A      END      1
4           A      MID      0
5           A    START      2
6           A    START      2
7           A      MID      2
8           A      MID      2
9           A      END      2
10          A      MID      0
11          A    START      3
12          A    START      3
13          A      MID      3
14          A      MID      3
15          A      END      3
16          A      MID      0
17          A    START      4

Upvotes: 1

Related Questions