sampeterson
sampeterson

Reputation: 479

Calculate distance between consecutive GPS points and reduce GPS density based on this distance

I have a pandas dataframe that represents the GPS trajectory of a vehicle

d1 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'longitude': [4.929783, 4.932333, 4.933950, 4.933900, 4.928467, 4.924583, 4.922133, 4.921400, 4.920967], 'latitude': [52.372250, 52.370884, 52.371101, 52.372234, 52.375282, 52.375950, 52.376301, 52.376232, 52.374481]}
df1 = pd.DataFrame(data=d1)

id   longitude   latitude     
1    4.929783    52.372250    
2    4.932333    52.370884    
3    4.933950    52.371101    
4    4.933900    52.372234    
5    4.928467    52.375282    
6    4.924583    52.375950    
7    4.922133    52.376301    
8    4.921400    52.376232    
9    4.920967    52.374481    

I already calculated the (haversine) distance in meters between consecutive GPS points as follows:

import numpy as np
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
    km = earth_radius * 2 * np.arcsin(np.sqrt(a))
    m = km * 1000
    return m

df1['distance'] = haversine(df1['latitude'], df1['longitude'],
                       df1['latitude'].shift(), df1['longitude'].shift())

id  longitude   latitude    distance
1   4.929783    52.372250   NaN
2   4.932333    52.370884   230.305288
3   4.933950    52.371101   112.398101
4   4.933900    52.372234   126.029572
5   4.928467    52.375282   500.896578
6   4.924583    52.375950   273.918990
7   4.922133    52.376301   170.828592
8   4.921400    52.376232   50.345227
9   4.920967    52.374481   196.908503

Now I would like to create a function that

  1. removes the second, i.e. the following point if the distance between consecutive GPS points is less than 150 meters.

  2. always keep the last (and the first) GPS point, regardless of the distance between the previous kept feature

Meaning this should be the output:

id  longitude   latitude    distance
1   4.929783    52.372250   NaN
2   4.932333    52.370884   230.305288
5   4.928467    52.375282   500.896578
6   4.924583    52.375950   273.918990
7   4.922133    52.376301   170.828592
9   4.920967    52.374481   196.908503

What is the best way to achieve this in python?

Upvotes: 2

Views: 350

Answers (1)

piRSquared
piRSquared

Reputation: 294258

NOTE: This doesn't account for maximum distance... that would require some look ahead and optimization.


I would iterate through and pass back just the index values of the rows you'd like to keep. Use those index values in a loc call.

Distance

Use whatever metric you want. I used OP's haversine distance.

def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
    lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
    km = earth_radius * 2 * np.arcsin(np.sqrt(a))
    m = km * 1000
    return m

def dis(t0, t1):
    lat0 = t0.latitude
    lon0 = t0.longitude
    lat1 = t1.latitude
    lon1 = t1.longitude
    return haversine(lat0, lon0, lat1, lon1)

The Loop

def f(d, threshold=50):
    itups = d.itertuples()
    
    last = next(itups)
    
    indices = [last.Index]
    distances = [0]

    for tup in itups:
        distance = dis(tup, last)
        if distance > threshold:
            indices.append(tup.Index)
            distances.append(distance)
            last = tup
            
    return indices, distances
        

The Results

idx, distances = f(df1, 150)
df1.loc[idx].assign(distance=distances)

   id  longitude   latitude    distance
0   1   4.929783  52.372250    0.000000
1   2   4.932333  52.370884  230.305288
3   4   4.933900  52.372234  183.986479
4   5   4.928467  52.375282  500.896578
5   6   4.924583  52.375950  273.918990
6   7   4.922133  52.376301  170.828592
8   9   4.920967  52.374481  217.302775

Upvotes: 1

Related Questions