user5576
user5576

Reputation: 129

Get order of subgroups in pandas dataframe

I have a pandas dataframe that looks something like this:

df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df


  Distance  Name   Time
0   23      Kate    3
1   16      John    5
2   32      Peter   2
3   15      Kate    7
4   31      John    9
5   26      Peter   2

I want to add a column that tells me, for each Name, what's the order of the times.

I want something like this:

 Order  Distance    Name    Time
    0       16      John    5
    1       31      John    9
    0       23      Kate    3
    1       15      Kate    7
    0       32      Peter   2
    1       26      Peter   4

I can do it using a for loop:

df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want

for name, row in df.groupby('Name').count().iterrows():
    table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
    to_concat = [df2,table]
    df2 = pd.concat(to_concat)

df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2

This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.

Can someone help me write this in a simpler way that runs faster? I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.

Best,

Upvotes: 2

Views: 1065

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

In [67]: df = df.sort_values(['Name','Time']) \
                .assign(Order=df.groupby('Name').cumcount())

In [68]: df
Out[68]:
   Distance   Name  Time  Order
1        16   John     5      0
4        31   John     9      1
0        23   Kate     3      0
3        15   Kate     7      1
2        32  Peter     2      0
5        26  Peter     4      1

PS I'm not sure this is the most elegant way to do this...

Upvotes: 2

jezrael
jezrael

Reputation: 862481

Use sort_values with cumcount:

df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)

   Distance   Name  Time  Order
1        16   John     5      0
4        31   John     9      1
0        23   Kate     3      0
3        15   Kate     7      1
2        32  Peter     2      0
5        26  Peter     4      1

If need first column use insert:

df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
   Order  Distance   Name  Time
1      0        16   John     5
4      1        31   John     9
0      0        23   Kate     3
3      1        15   Kate     7
2      0        32  Peter     2
5      1        26  Peter     4

Upvotes: 2

Related Questions