Python: sklearn.neighbors.KDTree not working as expected

Question

I am writing a program that should select points that are located in neighborhood of another point. A neighborhood size is specified by radius. I am using sklearn.neighbors.KDTree algorithm for this. However, it is not working as I expected.

To show you what I am dealing with, I have got two data frames:

df_example_points, which is a set of points I want to search in,

>>> import pandas as pd
>>> from sklearn.neighbors import KDTree
>>> df_example_points = pd.DataFrame(
...     {
...         'X': [-845.204, -845.262, -845.262, -845.262],
...         'Y': [-1986.243, -1986.077, -1986.077, -1986.079],
...         'Z': [246.655, 246.741, 246.742, 246.743],
...     }
... )
>>> print(df_example_points)
         X         Y        Z
0 -845.204 -1986.243  246.655
1 -845.262 -1986.077  246.741
2 -845.262 -1986.077  246.742
3 -845.262 -1986.079  246.743

and df_reference_point, which consists of a single point, which I want to use for defining its neighbourhood.

>>> df_reference_point = pd.DataFrame({'X': [-845.002], 'Y': [-1986.32], 'Z': [246.508]})
>>> print(df_reference_point)
         X        Y        Z
0 -845.002 -1986.32  246.508

When I try to hardcode what I expect from KDTree, it seems that every single point from df_example_points should be extracted by KDTree as a point that lays inside a reference point neighbourhood.

>>> radius = 0.27
>>> x_ref, y_ref, z_ref = df_reference_point.iloc[0]
>>> x_min, x_max = x_ref - radius, x_ref + radius
>>> y_min, y_max = y_ref - radius, y_ref + radius
>>> z_min, z_max = z_ref - radius, z_ref + radius
>>> for i, (x, y, z) in df_example_points.iterrows():
...     if all([x_min <= x <= x_max, y_min <= y <= y_max, z_min <= z <= z_max]):
...         print(f'Point {i} SHOULD be extracted.')
...     else:
...         print(f'Point {i} SHOULD NOT be extracted.')
Point 0 SHOULD be extracted.
Point 1 SHOULD be extracted.
Point 2 SHOULD be extracted.
Point 3 SHOULD be extracted.

However, when I try to use KDTree, only one point is extracted.

>>> tree = KDTree(df_example_points.values)
>>> extracted_points_indices = tree.query_radius(df_reference_point.values.reshape(1, -1), radius)[0]
>>> print(f'Number of extracted points: {len(extracted_points_indices)}')
Number of extracted points: 1

I want to use KDTree, because the implementation is much more faster. However, I cannot use it, when the result is not reliable. Please, could you help me, what am I doing wrong? What am I missing?

FBruzzesi · Accepted Answer

As @Gabriel commented, you are using two different distance metrics. The KDTree default is minkowski, while you are using chebyshev (you can check sklearn possible metrics here: DistanceMetric).

Changing the default will give your expected result:

tree = KDTree(df_example_points.values, metric='chebyshev')
extracted_points_indices = tree.query_radius(df_reference_point.values.reshape(1, -1), radius)[0]

print(f'Number of extracted points: {len(extracted_points_indices)}')
Number of extracted points: 4

Python: sklearn.neighbors.KDTree not working as expected

Answers (1)

Related Questions