Reputation: 848
I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t
as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN
as its delta_t
.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t
to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t
.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaN
s, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan
Upvotes: 2
Views: 253
Reputation: 51335
IIUC, there is no need for a loop. The following creates a new column called new_TripID
based on 2 conditions: That the original TripID
changes from one row to the next, or that the difference in your time
column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby
, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary
Upvotes: 3