Jan Kaiser
Jan Kaiser

Reputation: 848

Separate Pandas DataFrame into sections between rows that satisfy a condition

I have a DataFrame of several trips that looks kind of like this:

   TripID  Lat    Lon    time  delta_t
0  1       53.55  9.99   74    1
1  1       53.58  9.99   75    1
2  1       53.60  9.98   76    5
3  1       53.60  9.98   81    1
4  1       53.58  9.99   82    1
5  1       53.59  9.97   83    NaN
6  2       52.01  10.04  64    1
7  2       52.34  10.05  65    1
8  2       52.33  10.07  66    NaN

As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.

Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.

That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.

To consider each trip by itself, I was thinking of something like this:

 for key, grp in df.groupby('TripID'): 
     # split trip at too long delta_t(s)

However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.

I know this is quite a specific problem. I hope someone has an idea how to do this.

I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):

 df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan

Upvotes: 2

Views: 253

Answers (1)

sacuL
sacuL

Reputation: 51335

IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one

df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()

>>> df
   TripID    Lat    Lon  time  delta_t  new_TripID
0       1  53.55   9.99    74      1.0           1
1       1  53.58   9.99    75      1.0           1
2       1  53.60   9.98    76      5.0           1
3       1  53.60   9.98    81      1.0           2
4       1  53.58   9.99    82      1.0           2
5       1  53.59   9.97    83      NaN           2
6       2  52.01  10.04    64      1.0           3
7       2  52.34  10.05    65      1.0           3
8       2  52.33  10.07    66      NaN           3

Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary

Upvotes: 3

Related Questions