Reputation: 479
I have a pandas dataframe that represents the GPS trajectory of a vehicle
d1 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'longitude': [4.929783, 4.932333, 4.933950, 4.933900, 4.928467, 4.924583, 4.922133, 4.921400, 4.920967], 'latitude': [52.372250, 52.370884, 52.371101, 52.372234, 52.375282, 52.375950, 52.376301, 52.376232, 52.374481]}
df1 = pd.DataFrame(data=d1)
id longitude latitude
1 4.929783 52.372250
2 4.932333 52.370884
3 4.933950 52.371101
4 4.933900 52.372234
5 4.928467 52.375282
6 4.924583 52.375950
7 4.922133 52.376301
8 4.921400 52.376232
9 4.920967 52.374481
I already calculated the (haversine) distance in meters between consecutive GPS points as follows:
import numpy as np
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
df1['distance'] = haversine(df1['latitude'], df1['longitude'],
df1['latitude'].shift(), df1['longitude'].shift())
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
3 4.933950 52.371101 112.398101
4 4.933900 52.372234 126.029572
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
8 4.921400 52.376232 50.345227
9 4.920967 52.374481 196.908503
Now I would like to create a function that
removes the second, i.e. the following point if the distance between consecutive GPS points is less than 150 meters.
always keep the last (and the first) GPS point, regardless of the distance between the previous kept feature
Meaning this should be the output:
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
9 4.920967 52.374481 196.908503
What is the best way to achieve this in python?
Upvotes: 2
Views: 350
Reputation: 294258
NOTE: This doesn't account for maximum distance... that would require some look ahead and optimization.
I would iterate through and pass back just the index values of the rows you'd like to keep. Use those index values in a loc
call.
Use whatever metric you want. I used OP's haversine distance.
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
def dis(t0, t1):
lat0 = t0.latitude
lon0 = t0.longitude
lat1 = t1.latitude
lon1 = t1.longitude
return haversine(lat0, lon0, lat1, lon1)
def f(d, threshold=50):
itups = d.itertuples()
last = next(itups)
indices = [last.Index]
distances = [0]
for tup in itups:
distance = dis(tup, last)
if distance > threshold:
indices.append(tup.Index)
distances.append(distance)
last = tup
return indices, distances
idx, distances = f(df1, 150)
df1.loc[idx].assign(distance=distances)
id longitude latitude distance
0 1 4.929783 52.372250 0.000000
1 2 4.932333 52.370884 230.305288
3 4 4.933900 52.372234 183.986479
4 5 4.928467 52.375282 500.896578
5 6 4.924583 52.375950 273.918990
6 7 4.922133 52.376301 170.828592
8 9 4.920967 52.374481 217.302775
Upvotes: 1