Find continuous groups of numbers in a numpy 2D array allowing a max. distance on two independent values in Python

Question

I have a very big matrix where I need to cluster the data points based on two criteria:

they should not be further way from each other than 3 on the x axis to form a cluster
they should also not be further way from each other than 3 on the y axis to form a cluster.

For example the following data points in the matrix:

[14, 282681]
[14, 282680]
[21, 176161]
[22, 176162]
[37, 273403]
[37, 273443]
[41, 207638]

They should be grouped into:

{1: [[14, 282681][14, 282680]],
2: [[21, 176161],[22, 176162]],
3: [[37, 273403],
4: [[37, 273443]]],

Doing only 1) on a 1D array is trivial since one can sort the array and then just insert a break every time the gap is bigger than 3. I have tried this with list comprehension so far. But combining both criteria on both axis at the same time is really making my head twist a bit. If I sort it first according to the x-axis, then chop it according to gaps bigger than 3 and repeat the procedure after that on the y-axis, the data gets messed up. I have tried to insert the data points into a matrix and apply connected component labelling on this. It works, but it is slow as hell. I wonder if there is any faster, more elegant way to approach this problem?

Igor Rivin · Accepted Answer

The first problem is that there is no obvious unique solution to your problem - there are many ways to partition the data to satisfy your constraint. That aside, I think what you really want is to generate a K-d tree for your data (in your case, K=2), and it will do what you need. This is available in scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html and the query-ball-point method is most closely aligned to what you want.

Find continuous groups of numbers in a numpy 2D array allowing a max. distance on two independent values in Python

Answers (1)

Related Questions