Reputation: 848
I am having an issue where I have AIS data of several trips from Rotterdam to Hamburg. The route is split into 6 sectors, with the sector borders pre-defined for the route, and I need to know where and when a ship entered the next sector. I tried just using the last record within a sector, but the data's resolution isn't high enough for that. So I would like to interpolate the time and longitude based on the latitude of the sector border.
You can see the borders I decided on for this trip in the following image. The longitude of crossing the border is always exactly on the border line. What I need to determine is the latitude at which this line is crossed by a ship.
My DataFrame looks like this:
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
where the border between sectors 1 and 2 is pre-defined at longitude 4.8, so I would like to interpolate both the latitude and time at longitude 4.8 for each trip and sector border. I'm guessing a good solution will involve something like df.groupby(['TripID', 'SectorID'])
.
I tried adding an entry for each trip and sector that only has the Latitude of the sector's border in it and then using interpolate
, but adding the entries takes about an hour for me and interpolating the missing values crashes right away.
The result I'm looking for should look like this:
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
8 42 9.5 53.05 4.8 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
9 5 10.3 53.06 4.8 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
I would also be happy about and able to work with a result that looks like this:
TripID SectorID leave_lat leave_lon leave_time
42 1 53.05 4.8 9.5
5 1 53.06 4.8 10.3
Please ask, if my description of the problem isn't quite clear.
Upvotes: 2
Views: 1058
Reputation: 848
I solved the issue in a different way. Because this solved the problem for me, but isn't the exact solution I asked for, I will accept Mr. T's answer. I am posting this for the sake of completeness anyway, so here is my solution:
Starting with the DataFrame df
from my question
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
I used this code
df = df.sort_values('time')
df['next_lat'] = df.groupby('TripID')['Latitude'].shift(-1)
df['next_lon'] = df('TripID')['Longitude'].shift(-1)
df['next_time'] = df('TripID')['time'].shift(-1)
df['next_sector_id'] = df('TripID')['sector'].shift(-1)
df = df.sort_values(['TripID', 'time'])
df['next_trip_id'] = df['TripID'].shift(-1)
lasts = df[df['SectorID'] != df['next_sector_id']]
lasts.loc[lasts['SectorID'] == '1', 'sector_leave_lon'] = 4.8
lasts.loc[lasts['sector'] == '2', 'sector_leave_lat'] = lasts.loc[lasts['sector'] == '2', 'Latitude'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])
lasts.loc[lasts['sector'] == '2', 'sector_leave_time'] = lasts.loc[lasts['sector'] == '2', 'time'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_time'] - lasts.loc[lasts['sector'] == '2', 'time'])
df['sector_leave_lat'] = lasts['sector_leave_lat']
df['sector_leave_time'] = lasts['sector_leave_time']
df['sector_leave_lat'] = df(['TripID', 'sector'])['sector_leave_lat'].transform('last')
df['sector_leave_time'] = df(['TripID', 'sector'])['sector_leave_time'].transform('last')
df = df.drop(['next_lat', 'next_lon', 'next_time', 'next_sector_id', 'next_trip_id'], axis = 1)
which gives a result likes this
TripID time Latitude Longitude SectorID sector_leave_lat sector_leave_time
0 42 7 52.9 4.4 1 53.05 9.5
1 42 8 53.0 4.6 1 53.05 9.5
2 42 9 53.0 4.7 1 53.05 9.5
3 42 10 53.1 4.9 2 NaN NaN
4 5 9 53.0 4.5 1 53.06 10.3
5 5 10 53.0 4.7 1 53.06 10.3
6 5 11 53.2 5.0 2 NaN NaN
7 5 12 53.3 5.2 2 NaN NaN
I hope this helps those to who the actual solution doesn't cut it.
Upvotes: 0
Reputation: 12410
Since the usual pandas crew hasn't spotted this nice question, I give you a solution with some caveats. Here is the sample input, I used:
TripID time Latitude Longitude
42 7 52.9 4.4
42 8 53.0 4.6
42 9 53.0 4.7 * missing value
42 10 53.1 4.9
42 11 53.2 4.9
42 12 53.3 5.3 * missing value
42 15 53.7 5.6
5 9 53.0 4.5
5 10 53.0 4.7 * missing value
5 11 53.2 5.0
5 12 53.4 5.2
5 14 53.6 5.3 * missing value
5 17 53.4 5.5
5 18 53.3 5.7
34 19 53.0 4.5
34 20 53.0 4.7
34 24 53.9 4.8 ** value already exists
34 25 53.8 4.9
34 27 53.8 5.3
34 28 53.8 5.3 * missing value
34 31 53.7 5.6
34 32 53.6 5.7
This code:
import numpy as np
import pandas as pd
#import data
df = pd.read_csv("test.txt", delim_whitespace=True)
#set floating point output precision to prevent excessively long columns
pd.set_option("display.precision", 2)
#remember original column order
cols = df.columns
#define the sector borders
sectors = [4.8, 5.4]
#create all combinations of sector borders and TripIDs
dfborders = pd.DataFrame(index = pd.MultiIndex.from_product([df.TripID.unique(), sectors], names = ["TripID", "Longitude"])).reset_index()
#delete those combinations of TripID and Longitude that already exist in the original dataframe
dfborders = pd.merge(df, dfborders, on = ["TripID", "Longitude"], how = "right")
dfborders = dfborders[dfborders.isnull().any(axis = 1)]
#insert missing data points
df = pd.concat([df, dfborders])
#and sort dataframe to insert the missing data points in the right position
df = df[cols].groupby("TripID", sort = False).apply(pd.DataFrame.sort_values, ["Longitude", "time", "Latitude"])
#temporarily set longitude as index for value-based interpolation
df.set_index(["Longitude"], inplace = True, drop = False)
#interpolate group-wise
df = df.groupby("TripID", sort = False).apply(lambda g: g.interpolate(method = "index"))
#create sector ID column assuming that longitude is between -180 and +180
df["SectorID"] = np.digitize(df["Longitude"], bins = [-180] + sectors + [180])
#and reset index
df.reset_index(drop = True, inplace = True)
print(df)
produces the following output:
TripID time Latitude Longitude SectorID
0 42 7.00 52.90 4.4 1
1 42 8.00 53.00 4.6 1
2 42 9.00 53.00 4.7 1
3 42 9.50 53.05 4.8 2 * interpolated data point
4 42 10.00 53.10 4.9 2
5 42 11.00 53.20 4.9 2
6 42 12.00 53.30 5.3 2
7 42 13.00 53.43 5.4 3 * interpolated data point
8 42 15.00 53.70 5.6 3
9 5 9.00 53.00 4.5 1
10 5 10.00 53.00 4.7 1
11 5 10.33 53.07 4.8 2 * interpolated data point
12 5 11.00 53.20 5.0 2
13 5 12.00 53.40 5.2 2
14 5 14.00 53.60 5.3 2
15 5 15.50 53.50 5.4 3 * interpolated data point
16 5 17.00 53.40 5.5 3
17 5 18.00 53.30 5.7 3
18 34 19.00 53.00 4.5 1
19 34 20.00 53.00 4.7 1
20 34 24.00 53.90 4.8 2
21 34 25.00 53.80 4.9 2
22 34 27.00 53.80 5.3 2
23 34 28.00 53.80 5.3 2
24 34 29.00 53.77 5.4 3 * interpolated data point
25 34 31.00 53.70 5.6 3
26 34 32.00 53.60 5.7 3
Now to the caveats. I do not know, how to add the missing rows inplace. I will ask a question, how to do this. If I get an answer, I will update mine here. Until then, the side effect is that the table is sorted within each TripID
for Longitude
and the assumption is that Longitude
does not decrease, which in reality might not be the case.
Upvotes: 1