karsmithine
karsmithine

Reputation: 11

How do I find the closest point for each point in a data set while keeping precision?

This is my data set: https://pastebin.com/SsuKP2eH

I'm trying to find the nearest point for all points in the data set. These points are latitude and longitude on the Earth's surface. Of course, the nearest point cannot be the same point.

I tried the KDTree solutions listed in this post: https://stackoverflow.com/a/45128643 and changed the poster's random points (generated by np.random.uniform) to my own data set.

I expected to get an array full of distances, but instead, I got an array full of zeroes with some numbers like 2.87722e-06 and 0.616582 sprinkled in. This wasn't what I wanted. I tried the other solution, NearestNeighbours, on my data set and got the same result. So, I did some debugging and reduced the range of random numbers he used, making it closer to my own data set.

import numpy as np
import scipy.spatial as spatial
import pandas as pd

R = 6367

def using_kdtree(data):
    "Based on https://stackoverflow.com/q/43020919/190597"
    def dist_to_arclength(chord_length):
        """
        https://en.wikipedia.org/wiki/Great-circle_distance
        Convert Euclidean chord length to great circle arc length
        """
        central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
        arclength = R*central_angle
        return arclength

    phi = np.deg2rad(data['Latitude'])
    theta = np.deg2rad(data['Longitude'])
    data['x'] = R * np.cos(phi) * np.cos(theta)
    data['y'] = R * np.cos(phi) * np.sin(theta)
    data['z'] = R * np.sin(phi)
    tree = spatial.KDTree(data[['x', 'y','z']])
    distance, index = tree.query(data[['x', 'y','z']], k=2)
    return dist_to_arclength(distance[:, 1])
    #return distance, index


np.random.seed(2017)
N = 1000
#data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 'Longitude':np.random.uniform(0, 360,size=N)})
data = pd.DataFrame({'Latitude':np.random.uniform(-49.19,49.32,size=N), 'Longitude':np.random.uniform(-123.02, -123.23,size=N)})

result = using_kdtree(data)

I found that the resulting distances array had small values, close to 0. This makes me believe that the reason why the result array for my data set is full of zeroes is because the differences between points are very small. Somewhere, the KD Tree/nearest neighbours loses precision and outputs garbage. Is there a way to make them keep the precision of my floats? The brute-force method can keep precision but it is far too slow with 7200 points to iterate through.

Upvotes: 1

Views: 911

Answers (1)

Lukas S
Lukas S

Reputation: 3603

I think what's happening is that k=2 in

    distance, index = tree.query(data[['x', 'y','z']], k=2)

tells KDTree you want the closest two points to a point. So the closest is obviously the point itself with distance from itself being zero. Also if you print index you see a Nx2 array and each row starts with the number of the row. This is KDTree's way of saying well the closest point to the i-th point is the i-th point itself.

Obviously that is not useful and you probably want only the 2nd closest point. Fortunately I found this in the documentation of the k parameter of query

Either the number of nearest neighbors to return, or a list of the k-th nearest neighbors to return, starting from 1. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query

So

    distance, index = tree.query(data[['x', 'y','z']], k=[2])

gives only the distance and index of the 2nd to closest point.

Upvotes: 1

Related Questions