Reputation: 345
I have a very big matrix where I need to cluster the data points based on two criteria:
For example the following data points in the matrix:
[14, 282681]
[14, 282680]
[21, 176161]
[22, 176162]
[37, 273403]
[37, 273443]
[41, 207638]
They should be grouped into:
{1: [[14, 282681][14, 282680]],
2: [[21, 176161],[22, 176162]],
3: [[37, 273403],
4: [[37, 273443]]],
Doing only 1) on a 1D array is trivial since one can sort the array and then just insert a break every time the gap is bigger than 3. I have tried this with list comprehension so far. But combining both criteria on both axis at the same time is really making my head twist a bit. If I sort it first according to the x-axis, then chop it according to gaps bigger than 3 and repeat the procedure after that on the y-axis, the data gets messed up. I have tried to insert the data points into a matrix and apply connected component labelling on this. It works, but it is slow as hell. I wonder if there is any faster, more elegant way to approach this problem?
Upvotes: 1
Views: 310
Reputation: 4864
The first problem is that there is no obvious unique solution to your problem - there are many ways to partition the data to satisfy your constraint. That aside, I think what you really want is to generate a K-d tree for your data (in your case, K=2), and it will do what you need. This is available in scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html and the query-ball-point
method is most closely aligned to what you want.
Upvotes: 2