Reputation: 683
I am working with centered longitude (x) and latitude(y) data. My goal is to clustering the connected locations.
Two location on earth (x1,y1) and (x2,y2) are said to be connected if earth_distance((x1,y1),(x2,y2))<15 kilometer.
I am using the distHaversine function in R, to calculate earth distance.
Here is some sample data,
x=c(1.000000, 1.055672, 1.038712, 1.094459, 1.133179, 1.116241, 1.126053, 1.181824 ,1.377892, 5.869881, 5.925270, 5.909721)
and
y=c(1.333368,1.304790,1.347332,1.318743,1.332676,1.375229,1.572287,1.544174,2.371105,2.337032,2.383415)
also
distance <- distHaversine(c(x,y))
I wish find the different clusters formed by the different connected set of points (each connected set of points form a cluster).
I looked at How to cluster points and plot but I could not solved my problem.
Any reference, suggestion or answer will be very much appreciated.
Upvotes: 0
Views: 1182
Reputation: 94182
Maybe this. First make some coordinates:
> x=c(1.000000, 1.055672, 1.038712, 1.094459, 1.133179, 1.116241, 1.126053, 1.181824 ,1.377892, 5.869881, 5.925270)
> y=c(1.333368, 1.304790, 1.347332, 1.318743, 1.332676, 1.375229, 1.572287, 1.544174, 2.371105 ,2.337032, 2.383415)
Make into a data frame
> xy = data.frame(x=x,y=y)
Now use outer
to loop over all pairs of rows and columns to compute a full distance matrix. This does twice as much work as is really necessary since it computes i
to j
and j
to i
for all i
and j
. Anyway, it gets us a distance matrix:
> dmat = outer(1:nrow(xy), 1:nrow(xy), function(i,j)distHaversine(xy[i,],xy[j,]))
Now we want a connectivity matrix, which is any pair closer than 15,000 metres:
> cmat = dmat < 15000
Now we use the igraph
package to build a connectivity graph object:
> require(igraph)
> cgraph = graph.adjacency(cmat)
You can plot this to see the cluster formation, but note these are not plotted in your x-y space:
> plot(cgraph)
Now to get the connected clusters:
> clusters(cgraph)
$membership
[1] 1 1 1 1 1 1 2 2 3 4 4
$csize
[1] 6 2 1 2
$no
[1] 4
Which you can add to your data frame thus:
> xy$cluster = clusters(cgraph)$membership
> xy
x y cluster
1 1.000000 1.333368 1
2 1.055672 1.304790 1
3 1.038712 1.347332 1
4 1.094459 1.318743 1
5 1.133179 1.332676 1
6 1.116241 1.375229 1
7 1.126053 1.572287 2
8 1.181824 1.544174 2
9 1.377892 2.371105 3
10 5.869881 2.337032 4
11 5.925270 2.383415 4
And plot:
> plot(xy$x,xy$y,col=xy$cluster)
Upvotes: 2