Phoenix
Phoenix

Reputation: 73

Sort pandas dataframe column based on substring

I have a pandas dataframe, as shown below:

Timestamp_Start Event_ID Duration
    555.54944   Fix_1   0.42248
    559.07281   Fix_10  0.01996
    559.14642   Fix_11  0
    556.03192   Fix_2   0.16113
    556.27985   Fix_3   0.24188
    556.56097   Fix_4   0.04987
    556.65497   Fix_5   0.10748
    556.80859   Fix_6   0.75708
    557.57983   Fix_7   0.11329
    557.75348   Fix_8   0.65643
    558.43665   Fix_9   0.27447
    555.97925   Sac_1   0.04577
    559.09961   Sac_10  0.0404
    559.15302   Sac_11  0.00726
    556.19916   Sac_2   0.07403
    556.52747   Sac_3   0.02789
    556.61865   Sac_4   0.02985
    556.76849   Sac_5   0.0337
    557.57294   Sac_6   0
    557.69965   Sac_7   0.04687
    558.41632   Sac_8   0.01325
    558.71796   Sac_9   0.34552

I want to sort the 'Event_ID' column, so that Fix_1,Fix_2,Fix_3... and Sac_1,Sac_2,Sac_3... appear in order, like below:

Timestamp_StartEvent_ID Duration
    555.54944   Fix_1   0.42248
    556.03192   Fix_2   0.16113
    556.27985   Fix_3   0.24188
    556.56097   Fix_4   0.04987
    556.65497   Fix_5   0.10748
    556.80859   Fix_6   0.75708
    557.57983   Fix_7   0.11329
    557.75348   Fix_8   0.65643
    558.43665   Fix_9   0.27447
    559.07281   Fix_10  0.01996
    559.14642   Fix_11  0
    555.97925   Sac_1   0.04577
    556.19916   Sac_2   0.07403
    556.52747   Sac_3   0.02789
    556.61865   Sac_4   0.02985
    556.76849   Sac_5   0.0337
    557.57294   Sac_6   0
    557.69965   Sac_7   0.04687
    558.41632   Sac_8   0.01325
    558.71796   Sac_9   0.34552
    559.09961   Sac_10  0.0404
    559.15302   Sac_11  0.00726

Any ideas on how to do that? Thanks for your help.

Upvotes: 0

Views: 578

Answers (2)

Chris
Chris

Reputation: 29742

One way using distutils.version:

import numpy as np
from distutils.version import LooseVersion

f = np.vectorize(LooseVersion)
new_df = df.sort_values("Event_ID", key=f)
print(new_df)

Output:

    Timestamp_Start Event_ID  Duration
0         555.54944    Fix_1   0.42248
3         556.03192    Fix_2   0.16113
4         556.27985    Fix_3   0.24188
5         556.56097    Fix_4   0.04987
6         556.65497    Fix_5   0.10748
7         556.80859    Fix_6   0.75708
8         557.57983    Fix_7   0.11329
9         557.75348    Fix_8   0.65643
10        558.43665    Fix_9   0.27447
1         559.07281   Fix_10   0.01996
2         559.14642   Fix_11   0.00000
11        555.97925    Sac_1   0.04577
14        556.19916    Sac_2   0.07403
15        556.52747    Sac_3   0.02789
16        556.61865    Sac_4   0.02985
17        556.76849    Sac_5   0.03370
18        557.57294    Sac_6   0.00000
19        557.69965    Sac_7   0.04687
20        558.41632    Sac_8   0.01325
21        558.71796    Sac_9   0.34552
12        559.09961   Sac_10   0.04040
13        559.15302   Sac_11   0.00726

Upvotes: 4

Sreeram TP
Sreeram TP

Reputation: 11927

Normal sorting on the dataframe will not work, as you need the integer in the string to be treated as int value.

It can be done with extra space though.

You can make two columns like this,

df['event'] = df.Event_ID.str.rsplit("_").str[0]
df['idx'] = df.Event_ID.str.rsplit("_").str[-1].astype(int)

Now, sort on these two columns,

df.sort_values(['event', 'idx'])


    Timestamp_Start Event_ID  Duration  idx event
0         555.54944    Fix_1   0.42248    1   Fix
3         556.03192    Fix_2   0.16113    2   Fix
4         556.27985    Fix_3   0.24188    3   Fix
5         556.56097    Fix_4   0.04987    4   Fix
6         556.65497    Fix_5   0.10748    5   Fix
7         556.80859    Fix_6   0.75708    6   Fix
8         557.57983    Fix_7   0.11329    7   Fix
9         557.75348    Fix_8   0.65643    8   Fix
10        558.43665    Fix_9   0.27447    9   Fix
1         559.07281   Fix_10   0.01996   10   Fix
2         559.14642   Fix_11   0.00000   11   Fix
11        555.97925    Sac_1   0.04577    1   Sac
14        556.19916    Sac_2   0.07403    2   Sac
15        556.52747    Sac_3   0.02789    3   Sac
16        556.61865    Sac_4   0.02985    4   Sac
17        556.76849    Sac_5   0.03370    5   Sac
18        557.57294    Sac_6   0.00000    6   Sac
19        557.69965    Sac_7   0.04687    7   Sac
20        558.41632    Sac_8   0.01325    8   Sac
21        558.71796    Sac_9   0.34552    9   Sac
12        559.09961   Sac_10   0.04040   10   Sac
13        559.15302   Sac_11   0.00726   11   Sac

You can reset_index, drop extra columns as needed

Upvotes: 1

Related Questions