Jan Kaiser
Jan Kaiser

Reputation: 848

Interpolate specific entries in DataFrame depending on groups

I am having an issue where I have AIS data of several trips from Rotterdam to Hamburg. The route is split into 6 sectors, with the sector borders pre-defined for the route, and I need to know where and when a ship entered the next sector. I tried just using the last record within a sector, but the data's resolution isn't high enough for that. So I would like to interpolate the time and longitude based on the latitude of the sector border.

You can see the borders I decided on for this trip in the following image. The longitude of crossing the border is always exactly on the border line. What I need to determine is the latitude at which this line is crossed by a ship.

Map of my borders

My DataFrame looks like this:

       TripID  time  Latitude Longitude  SectorID
0      42       7    52.9     4.4        1
1      42       8    53.0     4.6        1
2      42       9    53.0     4.7        1
3      42      10    53.1     4.9        2
4       5       9    53.0     4.5        1
5       5      10    53.0     4.7        1
6       5      11    53.2     5.0        2
7       5      12    53.3     5.2        2

where the border between sectors 1 and 2 is pre-defined at longitude 4.8, so I would like to interpolate both the latitude and time at longitude 4.8 for each trip and sector border. I'm guessing a good solution will involve something like df.groupby(['TripID', 'SectorID']).

I tried adding an entry for each trip and sector that only has the Latitude of the sector's border in it and then using interpolate, but adding the entries takes about an hour for me and interpolating the missing values crashes right away.

The result I'm looking for should look like this:

       TripID  time  Latitude Longitude  SectorID
0      42       7    52.9     4.4        1
1      42       8    53.0     4.6        1
2      42       9    53.0     4.7        1
8      42     9.5   53.05     4.8        1
3      42      10    53.1     4.9        2
4       5       9    53.0     4.5        1
5       5      10    53.0     4.7        1
9       5    10.3   53.06     4.8        1
6       5      11    53.2     5.0        2
7       5      12    53.3     5.2        2

I would also be happy about and able to work with a result that looks like this:

 TripID  SectorID  leave_lat  leave_lon  leave_time
 42      1         53.05      4.8        9.5
 5       1         53.06      4.8        10.3

Please ask, if my description of the problem isn't quite clear.

Upvotes: 2

Views: 1058

Answers (2)

Jan Kaiser
Jan Kaiser

Reputation: 848

I solved the issue in a different way. Because this solved the problem for me, but isn't the exact solution I asked for, I will accept Mr. T's answer. I am posting this for the sake of completeness anyway, so here is my solution:

Starting with the DataFrame df from my question

        TripID  time  Latitude Longitude  SectorID
0      42       7    52.9     4.4        1
1      42       8    53.0     4.6        1
2      42       9    53.0     4.7        1
3      42      10    53.1     4.9        2
4       5       9    53.0     4.5        1
5       5      10    53.0     4.7        1
6       5      11    53.2     5.0        2
7       5      12    53.3     5.2        2

I used this code

df = df.sort_values('time')

df['next_lat'] = df.groupby('TripID')['Latitude'].shift(-1)
df['next_lon'] = df('TripID')['Longitude'].shift(-1)
df['next_time'] = df('TripID')['time'].shift(-1)
df['next_sector_id'] = df('TripID')['sector'].shift(-1)
df = df.sort_values(['TripID', 'time'])
df['next_trip_id'] = df['TripID'].shift(-1)

lasts = df[df['SectorID'] != df['next_sector_id']]

lasts.loc[lasts['SectorID'] == '1', 'sector_leave_lon'] = 4.8

lasts.loc[lasts['sector'] == '2', 'sector_leave_lat'] = lasts.loc[lasts['sector'] == '2', 'Latitude'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])

lasts.loc[lasts['sector'] == '2', 'sector_leave_time'] = lasts.loc[lasts['sector'] == '2', 'time'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_time'] - lasts.loc[lasts['sector'] == '2', 'time'])

df['sector_leave_lat'] = lasts['sector_leave_lat']
df['sector_leave_time'] = lasts['sector_leave_time']

df['sector_leave_lat'] = df(['TripID', 'sector'])['sector_leave_lat'].transform('last')
df['sector_leave_time'] = df(['TripID', 'sector'])['sector_leave_time'].transform('last')

df = df.drop(['next_lat', 'next_lon', 'next_time', 'next_sector_id', 'next_trip_id'], axis = 1)

which gives a result likes this

        TripID  time  Latitude Longitude  SectorID  sector_leave_lat  sector_leave_time
0      42       7    52.9     4.4        1          53.05              9.5
1      42       8    53.0     4.6        1          53.05              9.5
2      42       9    53.0     4.7        1          53.05              9.5
3      42      10    53.1     4.9        2          NaN               NaN
4       5       9    53.0     4.5        1          53.06             10.3
5       5      10    53.0     4.7        1          53.06             10.3
6       5      11    53.2     5.0        2          NaN               NaN
7       5      12    53.3     5.2        2          NaN               NaN

I hope this helps those to who the actual solution doesn't cut it.

Upvotes: 0

Mr. T
Mr. T

Reputation: 12410

Since the usual pandas crew hasn't spotted this nice question, I give you a solution with some caveats. Here is the sample input, I used:

TripID  time  Latitude Longitude  
42       7    52.9     4.4        
42       8    53.0     4.6        
42       9    53.0     4.7 * missing value
42      10    53.1     4.9 
42      11    53.2     4.9         
42      12    53.3     5.3 * missing value
42      15    53.7     5.6    
5        9    53.0     4.5        
5       10    53.0     4.7  * missing value
5       11    53.2     5.0       
5       12    53.4     5.2        
5       14    53.6     5.3  * missing value
5       17    53.4     5.5        
5       18    53.3     5.7  
34      19    53.0     4.5  
34      20    53.0     4.7          
34      24    53.9     4.8  ** value already exists
34      25    53.8     4.9        
34      27    53.8     5.3        
34      28    53.8     5.3  * missing value
34      31    53.7     5.6        
34      32    53.6     5.7 

This code:

import numpy as np
import pandas as pd

#import data
df = pd.read_csv("test.txt", delim_whitespace=True)

#set floating point output precision to prevent excessively long columns
pd.set_option("display.precision", 2)
#remember original column order
cols = df.columns
#define the sector borders
sectors = [4.8, 5.4]

#create all combinations of sector borders and TripIDs
dfborders = pd.DataFrame(index = pd.MultiIndex.from_product([df.TripID.unique(), sectors], names = ["TripID", "Longitude"])).reset_index()
#delete those combinations of TripID and Longitude that already exist in the original dataframe
dfborders = pd.merge(df, dfborders, on = ["TripID", "Longitude"], how = "right")
dfborders = dfborders[dfborders.isnull().any(axis = 1)]
#insert missing data points
df = pd.concat([df, dfborders])
#and sort dataframe to insert the missing data points in the right position
df = df[cols].groupby("TripID", sort = False).apply(pd.DataFrame.sort_values, ["Longitude", "time", "Latitude"])

#temporarily set longitude as index for value-based interpolation
df.set_index(["Longitude"], inplace = True, drop = False)
#interpolate group-wise
df = df.groupby("TripID", sort = False).apply(lambda g: g.interpolate(method = "index"))
#create sector ID column assuming that longitude is between -180 and +180
df["SectorID"] = np.digitize(df["Longitude"], bins = [-180] + sectors + [180])
#and reset index
df.reset_index(drop = True, inplace = True)
print(df)

produces the following output:

    TripID   time  Latitude  Longitude  SectorID
0       42   7.00     52.90        4.4         1
1       42   8.00     53.00        4.6         1
2       42   9.00     53.00        4.7         1
3       42   9.50     53.05        4.8         2 * interpolated data point
4       42  10.00     53.10        4.9         2
5       42  11.00     53.20        4.9         2
6       42  12.00     53.30        5.3         2
7       42  13.00     53.43        5.4         3 * interpolated data point
8       42  15.00     53.70        5.6         3
9        5   9.00     53.00        4.5         1
10       5  10.00     53.00        4.7         1
11       5  10.33     53.07        4.8         2 * interpolated data point
12       5  11.00     53.20        5.0         2
13       5  12.00     53.40        5.2         2
14       5  14.00     53.60        5.3         2
15       5  15.50     53.50        5.4         3 * interpolated data point
16       5  17.00     53.40        5.5         3
17       5  18.00     53.30        5.7         3
18      34  19.00     53.00        4.5         1
19      34  20.00     53.00        4.7         1
20      34  24.00     53.90        4.8         2
21      34  25.00     53.80        4.9         2
22      34  27.00     53.80        5.3         2
23      34  28.00     53.80        5.3         2
24      34  29.00     53.77        5.4         3 * interpolated data point
25      34  31.00     53.70        5.6         3
26      34  32.00     53.60        5.7         3

Now to the caveats. I do not know, how to add the missing rows inplace. I will ask a question, how to do this. If I get an answer, I will update mine here. Until then, the side effect is that the table is sorted within each TripID for Longitude and the assumption is that Longitude does not decrease, which in reality might not be the case.

Upvotes: 1

Related Questions