Reputation: 113
I'm working with a large set comprising of spatial parcels, with each row containing geographic coordinates (UTM), parcel area & value:
[x, y, area, value]:
[272564.9434265977, 6134243.108910706, 980.63, 550.6664083293393],
[272553.9611341293, 6134209.499155387, 1026.55, 477.32696897374706],
[271292.4197118982, 6132982.047648986, 634.438, 851.1469993915875],
...
Plotting these visually identifies several distinct zones where dollar value varies based on geography (the high value strip on the left is coastal, for example):
I would like to identify clusters of value (ie the coastal strip) & have looked at several approaches;
K-means seems the easiest clustering method to implement, but appears unsuitable due to only considering distance between points and no further attributes.
ClusterPy looks ideal for this application but their documentation only seems to cover working with GIS files.
DBSCAN seems more relevant but I'm not sure how I can include my additional attribute ($ value) - perhaps as a third dimension?
Can anybody suggest any other toolkits/approaches to consider?
Upvotes: 3
Views: 1725
Reputation: 125
What about creating price contours? (like the type of contours in geological maps). Instead of a contour connecting points of similar elevation the contours would connect points of similar price.
You'd get a map of parcels that are "clustered" according to contour intervals (price values) but with the contour boundaries defining zones reflecting certain price characteristics.
You could then extract the parcels that lie within each price band (contour) and assign them a particular cluster number. Doing this for all the parcels would give you spatially connected "clusters" of prices that truly reflect the observed data without needing to rely on complex ML clustering algorithms that never seem to get things right.
Upvotes: 0
Reputation: 285
At least in hierarchical clustering you can define connectivity constraints such that only "connected" samples can belong to same cluster. In your case x
and y
would be used by function sklearn.neighbors.kneighbors_graph() to create the list of neighbours, and the value
variable will be used in the clustering.
Upvotes: 2
Reputation: 77454
Look at generalized DBSCAN (GDBSCAN), which easily allows you to require neighbor points to both
Upvotes: 2