Reputation: 2203
I have many points (latitudes and longitudes) on a plane (a city) and I want to find two clusters. Cluster 1 is points cluttered close together and Cluster 2 is everything else.
I know the definition of the problem is not exact. The only thing defined is that I need exactly 2 clusters. Out of N points, how many end up in cluster 1 or cluster 2 is not defined.
The main aim is to identify points which are very close to each other and separate them from the rest (which are more more evenly spread out)
The best I can think of is the following algorithm:
1. For each point, Calculate the sum of the square distances to all other points.
2. Run the k-means with k=2 on these square distances
The squaring (or maybe even higher order) of the distance should help by raising the dimensionality. However this algorithm will be biased towards points near the center of the city. It will struggle to find clusters at the edges of the city.
Any suggestions on how to avoid this problem? And any other suggestions to improve this algorithm
Upvotes: 0
Views: 1156
Reputation: 156
Since points in cluster 1 are close to each other, I think a density-based clustering algorithm may help. You may try the OPTICS algorithm, which is similar to DBSCAN but is aware of varying density and the number of clusters can be specified by the user.
Upvotes: 1
Reputation: 17238
i'd suggest something along the following lines:
key concept
count number of neighbouring points at distance less than a given value.
semiformal description
nc(P)
of neighbouring points at distance less than a given value d_cutoff
for each point P
.P_i
with nc(P_i)
greater than a threshold thres_count
into cluster #1.P_i
in cluster #1 add its close neighbors, i.e. points Q
with d(Q, P_i) < d_cutoff
to the very same cluster #1.algorithmic angle
G=(V, E)
with your points being the vertex set V
and an edge between every pair of points at a distance less than d_cutoff
from each other.e=(v,w)
from the graph where deg(v) < thres_count
and deg(w) < thres_count
.G
's isolated vertices form cluster #2, the complement is cluster #1.heuristic on how to choose d_cutoff
build a minimum spanning tree (mst) of your point set. the frequency distribution of edge lengths should hint at suitable cutoff values. short pairwise distances will be incorporated into the mst first. thus there should be at least one pronounced gap in the ordered sequence of edge lengths for point sets with a natural clustering. so partition the set of mst edge lengths into a small number of adjacent intervals, ordering these intervals in the natural way. count how many actual distance values fall into each interval. consider the map between an interval's ordinal number and its count of distance values. large deltas between functions values for successive arguments would suggest to take the upper bound of distances in the lower interval as d_cutoff
.
Upvotes: 1